I have a bunch of MP3 files stored on my box, and since iTunes doesn’t seem to care of new files exist in the library or not, and just imports them whatever their content, I just wrote this little perl thing to find duplicate files in a directory hierarchy. It’s rather simple, less than 200 lines of perl code, but it seems to work well.
To find duplicate files, the script traverses a directory tree and makes a hash of all files using MD5 and SHA256. This is put into a SQLite database and a second run of the script checks for duplicates. So first, make the database:
$ ./finddup.pl -I -p /path/to/data
Then look for duplicates:
$ ./finddup.pl -S -p /path/to/data
The SQLite database will be placed under the data directory given. If no path is given, the current directory is used for the search. This thing doesn’t scale too well on multi-terabyte filesystems, due to SQLite and perhaps my code, but for the databasee part, it should be trivial to port it to something else, like MySQL, PostgreSQL, or whatever you might like.
See http://karlsbakk.net/finddup/ for the script – licensed under GPLv2.