dupper.pl - finds duplicate files, optionally removes them
To get a list of duplicate files in a particular directory:
$ dupper.pl ~/public_html/images
Script to find (and optionally remove) duplicate files in one or more directories. Duplicates are spotted though the use of MD5 checksums.
$ dupper.pl [options] [dir1 dir2 .. dirN]
See OPTIONS for details on the command line switches supported.
A list of directories to operate on should be specified on the command line. Failing that, the script will attempt to read directories from STDIN.
This script currently supports the following command line switches:
-s 'm/^\\.rsrc\$/ || -z \$pti || -s \$pti > 1048576'
Would skip the checksum on files named '.rsrc', or files that are empty via the -z is-empty test, or files larger than a megabyte.
-p 'm/etc/'
Both -s and -p have access to the filename in $_, and can find the full filepath in the variable $pti. (Short for ``path to item'' in case you were wondering.)
To remove duplicate files under a web area via cron, skipping html documents, matching only in local directories, and being very quiet:
21 6 * * * dupper.pl -uqzs 'm/\.html$/' /www/example/images
Newer versions of this script may be available from:
If the bug is in the latest version, send a report to the author. Patches that fix problems or add new features are welcome.
Large files will stall the checksum generation. Have to spend a bit of time and rewrite the checksum thingy to do files in chunks properly.
Change out the hackish manual directory recurser with File::Find.
perl(1)
Jeremy Mates, http://sial.org/contact/
The author disclaims all copyrights and releases this script into the public domain.
$Id: dupper.pl,v 2.7 2003/05/26 17:59:29 jmates Exp $