Find and deal with Duplicate Files

With in any one system, there may be hundreds of files that are essentially taking up space.  These files range from JPG images, MP3 down to program files and the occasional folder backup.   However finding these files can be quite time consuming.

Finding the right tool for the job is almost as difficult as finding the duplicate files in the first place.  The programs available are usually fairly limited and come at a price – what happened to free software!

Whilst out on the net, I came across a really neat bit of utility software that runs from the command line.  The program author wanted to do the same thing as me.  Find duplicate files and use the Hard linking feature of the NTFS file system to enable me to keep files in the same location, but recover wasted space. 

Like me, he tried all of the shareware programs, but none of them was able to hardlink files, so in his frustration he decided to write his own freeware utility. He called it finddupe.

Finddupe.exe is a powerful utility that has options to just find duplicate files, hard link them or just delete the duplicates.  It does this by analyzing the first 32kb of the file to produce a CRC value.  The CRC values are then used to track down and deal with the duplicates.  When duplicate files are found, the utility is clever enough to scan the whole file to generate the file CRC value, this enables files with the same name but with slightly different content to remain distinct.

The potential this utility has is amazing.  Just think about all of those MP3 files that kids share on the school network these days.  You could run the utility once a week to free up space, by finding those MP3 files in users home folders and just replacing the kids version with hardlink.  Of course if I had my way MP3 files really wouldn't be on the users home folders at all – but that is a different matter.

Hardlinked files work in a similar way to symbolic links in the Linux operating system. One file with infinite links.  This of course means that when the file is edited, the edited version will apply to all linked files.  So maybe this isn't quite so clever to be using on pupils word documents.  When the file is deleted from a user area, only the hardlink is deleted, rather than the original file.  If the original file is deleted, the next hardlink in the chain becomes the file containing the data.  See http://en.wikipedia.org/wiki/Hard_link for more details on what a hardlink is.

In theory you will save some disk space by using hardlink files, but when it comes to backing up the files, they will still take up more space on the backup media, unless the backup software understands how to process hardlink and junction points.

So where can you get this amazing utility?  Well download it from http://www.sentex.net/~mwandel/finddupe/ and enjoy using it.

Posted in General.