[geeks] listing identical files
Scott Howard
scott at doc.net.au
Mon Jan 10 19:39:48 CST 2005
On Mon, Jan 10, 2005 at 01:48:00PM -0600, Doug McLaren wrote:
> You'd need to cmp every file of a given size against every other file
> of a given size. It's an O(N^2 * S) operation, where N is the number
> of files of that exact size and S is the size.
>
> At least with md5sums you can keep each md5sum in memory and quickly
> compare it against any other file, no matter how big the file size in
> question is.
Attached is a script I wrote some time back. It compares all of the
files in a directory, and when it finds files which are the same
it hard links them - it would be trivial to change it to just list
the files.
It does use a fairly "braindead" approach in that it MD5 sum's all
files where it really should only do the ones where it finds two
files the same size/owner/etc, but it suited what I wanted...
Scott
#!/usr/bin/perl
use File::Find;
use Digest::MD5;
sub Usage
{
print "Usage : $0 dir\n\n";
print "Finds all files under 'dir' which are the same, and creates hard-links\n";
print "between them. 'Same' is defined as having the same owner/group/perms/size\n";
print "and MD5 checksum.\n\n";
print "WARNING: This should only be used on a static filesystem. Do not run it\n";
print "on files which may change in the future, as changes to one file will\n";
print "then result in all hard-linked files being changed as well, which is\n";
print "probably not what you want/expect!\n\n";
exit 1;
}
sub douniq
{
-f || return; # We don't do dirs
-l && return; # Or symlinks
$f = $File::Find::name;
push @Files, $f;
}
(scalar(@ARGV) eq 1) || Usage;
($DIR) = @ARGV;
find \&douniq, $DIR;
foreach $f (@Files) {
($Dev, $Inode, $Mode, $UID, $GID, $Size, $mtime) = (stat($f))[0,1,2,4,5,7,9];
next if (!$Size);
my $M5=Digest::MD5->new;
$M5->add("$Dev", "$Mode", "$UID", "$GID", "$Size", "$mtime");
open(F, $f) || next;
binmode(F);
$M5->addfile(*F);
close(F);
$md5sum=$M5->hexdigest;
if ($MasterFile{$md5sum}) {
if ($MasterINode{$md5sum} != $Inode) {
#print "Files same - Linking ($f to $MasterFile{$md5sum})\n";
unlink("$f");
link("$MasterFile{$md5sum}", "$f");
}
} else {
$MasterFile{$md5sum}=$f;
$MasterINode{$md5sum}=$Inode;
}
}
More information about the geeks
mailing list