Find and deal with Duplicate Files

With in any one system, there may be hundreds of files that are essentially taking up space.  These files range from JPG images, MP3 down to program files and the occasional folder backup.   However finding these files can be quite time consuming. Finding the right tool for the job is almost as difficult as finding…

With in any one system, there may be hundreds of files that are essentially taking up space.  These files range from JPG images, MP3 down to program files and the occasional folder backup.   However finding these files can be quite time consuming.

Finding the right tool for the job is almost as difficult as finding the duplicate files in the first place.  The programs available are usually fairly limited and come at a price – what happened to free software!

Whilst out on the net, I came across a really neat bit of utility software that runs from the command line.  The program author wanted to do the same thing as me.  Find duplicate files and use the Hard linking feature of the NTFS file system to enable me to keep files in the same location, but recover wasted space. 

Like me, he tried all of the shareware programs, but none of them was able to hardlink files, so in his frustration he decided to write his own freeware utility. He called it finddupe.

Finddupe.exe is a powerful utility that has options to just find duplicate files, hard link them or just delete the duplicates.  It does this by analyzing the first 32kb of the file to produce a CRC value.  The CRC values are then used to track down and deal with the duplicates.  When duplicate files are found, the utility is clever enough to scan the whole file to generate the file CRC value, this enables files with the same name but with slightly different content to remain distinct.

The potential this utility has is amazing.  Just think about all of those MP3 files that kids share on the school network these days.  You could run the utility once a week to free up space, by finding those MP3 files in users home folders and just replacing the kids version with hardlink.  Of course if I had my way MP3 files really wouldn't be on the users home folders at all – but that is a different matter.

Hardlinked files work in a similar way to symbolic links in the Linux operating system. One file with infinite links.  This of course means that when the file is edited, the edited version will apply to all linked files.  So maybe this isn't quite so clever to be using on pupils word documents.  When the file is deleted from a user area, only the hardlink is deleted, rather than the original file.  If the original file is deleted, the next hardlink in the chain becomes the file containing the data.  See http://en.wikipedia.org/wiki/Hard_link for more details on what a hardlink is.

In theory you will save some disk space by using hardlink files, but when it comes to backing up the files, they will still take up more space on the backup media, unless the backup software understands how to process hardlink and junction points.

So where can you get this amazing utility?  Well download it from http://www.sentex.net/~mwandel/finddupe/ and enjoy using it.

Similar Posts

  • PC Emulator Virtual PC

    Have you ever needed to test out a new bit of software or a new version of an operating system but didn’t have a spare computer that you could allocate for the testing?  If so, the solution is simple.  Your computer can run a bit of software that will emulate a computer inside of a…

  • The Ultimate Boot CD

    For the past year or so, I have been using a magical bootable CD that enabled me to boot a computer into a working version of Windows XP so that I could recover files, run diagnostics or Ghost a hard drive to another machine.  What is this magical CD?

  • FEAD Optimizer Extract Tool

    It has finally arrived.  Introducing the FEAD Optimizer Extraction tool. This tool is designed to allow the quick extraction of a FEAD Optimized installer application such as Acrobat Reader.  I am surprised that NetOpSystems who make FEAD have not developed an application for network administrators that will allow easy extraction of the proprietary compressed files. …

  • cURL error 28: Connection timed out

    One of my work WordPress sites is hosted internally and requires an upstream proxy to access the internet. I’ve had issues with the CRON jobs failing and the cURL error 28 message appearing, but never managed to figure out the issue until now. WordPress 5.1 brought in a new feature in the Tools menu called…

  • Internet Explorer Proxy settings

    Here is a bit of VB Script that will change the proxy settings of a machine if you are unable to access the Internet Explorer control panel or the Windows Registry. USE AT YOUR OWN RISK