NAME

dupper.pl - finds duplicate files, optionally removes them


SYNOPSIS

To get a list of duplicate files in a particular directory:

  $ dupper.pl ~/public_html/images


DESCRIPTION

Overview

Script to find (and optionally remove) duplicate files in one or more directories. Duplicates are spotted though the use of MD5 checksums.

Normal Usage

  $ dupper.pl [options] [dir1 dir2 .. dirN]

See OPTIONS for details on the command line switches supported.

A list of directories to operate on should be specified on the command line. Failing that, the script will attempt to read directories from STDIN.


OPTIONS

This script currently supports the following command line switches:

-h, -?
Prints a brief usage note about the script.

-v
Verbose mode, a little bit more chatty.

-q
Quiet mode, overrides verbose mode, turns off reporting.

-u
Attempt to unlink any duplicates past the first. Default is just to report the duplicate files. Files are added least-depth first by order of sort(). In other words, be sure you are deleting the right thing.

-l
Local-only mode; script will not recurse into subdirectories of any directories passed to the script.

-g
Make checksums apply across all directories on the command line. Default is to treat the various directories supplied to the program as their own separate realms.

-z
Overrides -g. Limits the scope of checksums to only other files in the exact same local directory as one another. Much tighter than the default scope.

-s expression
Perl expression that will result in the current file (stored in $_) being skipped if the expression turns out to be ``true.'' Example:
  -s 'm/^\\.rsrc\$/ || -z \$pti || -s \$pti > 1048576'

Would skip the checksum on files named '.rsrc', or files that are empty via the -z is-empty test, or files larger than a megabyte.

-p expression
Perl expression that will result in the current directory (stored in $_) being pruned out of the tree. Like config dirs, for example:
  -p 'm/etc/'

Both -s and -p have access to the filename in $_, and can find the full filepath in the variable $pti. (Short for ``path to item'' in case you were wondering.)


EXAMPLES

To remove duplicate files under a web area via cron, skipping html documents, matching only in local directories, and being very quiet:

  21 6 * * * dupper.pl -uqzs 'm/\.html$/' /www/example/images


BUGS

Reporting Bugs

Newer versions of this script may be available from:

http://sial.org/code/perl/

If the bug is in the latest version, send a report to the author. Patches that fix problems or add new features are welcome.

Known Issues

Large files will stall the checksum generation. Have to spend a bit of time and rewrite the checksum thingy to do files in chunks properly.


TODO

Change out the hackish manual directory recurser with File::Find.


SEE ALSO

perl(1)


AUTHOR

Jeremy Mates, http://sial.org/contact/


COPYRIGHT

The author disclaims all copyrights and releases this script into the public domain.


VERSION

  $Id: dupper.pl,v 2.7 2003/05/26 17:59:29 jmates Exp $