[geeks] listing identical files

Scott Howard scott at doc.net.au
Mon Jan 10 19:39:48 CST 2005


On Mon, Jan 10, 2005 at 01:48:00PM -0600, Doug McLaren wrote:
> You'd need to cmp every file of a given size against every other file
> of a given size.  It's an O(N^2 * S) operation, where N is the number
> of files of that exact size and S is the size.
> 
> At least with md5sums you can keep each md5sum in memory and quickly
> compare it against any other file, no matter how big the file size in
> question is.

Attached is a script I wrote some time back. It compares all of the
files in a directory, and when it finds files which are the same
it hard links them - it would be trivial to change it to just list
the files.

It does use a fairly "braindead" approach in that it MD5 sum's all
files where it really should only do the ones where it finds two
files the same size/owner/etc, but it suited what I wanted...

  Scott
#!/usr/bin/perl

use File::Find;
use Digest::MD5;

sub Usage
{
	print "Usage : $0 dir\n\n";
	print "Finds all files under 'dir' which are the same, and creates hard-links\n";
	print "between them.  'Same' is defined as having the same owner/group/perms/size\n";
	print "and MD5 checksum.\n\n";
	print "WARNING: This should only be used on a static filesystem. Do not run it\n";
	print "on files which may change in the future, as changes to one file will\n";
	print "then result in all hard-linked files being changed as well, which is\n";
	print "probably not what you want/expect!\n\n";
	exit 1;
}

sub douniq
{
	-f || return;   # We don't do dirs
	-l && return;	# Or symlinks
	$f = $File::Find::name;
	push @Files, $f;
}


(scalar(@ARGV) eq 1) || Usage;

($DIR) = @ARGV;

find \&douniq, $DIR;

foreach $f (@Files) {
	($Dev, $Inode, $Mode, $UID, $GID, $Size, $mtime) = (stat($f))[0,1,2,4,5,7,9];
	next if (!$Size);
	my $M5=Digest::MD5->new;
	$M5->add("$Dev", "$Mode", "$UID", "$GID", "$Size", "$mtime");
	open(F, $f) || next;
	binmode(F);
	$M5->addfile(*F);
	close(F);
	$md5sum=$M5->hexdigest;

	if ($MasterFile{$md5sum}) {
		if ($MasterINode{$md5sum} != $Inode) {
			#print "Files same - Linking ($f to $MasterFile{$md5sum})\n";
			unlink("$f");
			link("$MasterFile{$md5sum}", "$f");
		}
	} else {
		$MasterFile{$md5sum}=$f;
		$MasterINode{$md5sum}=$Inode;
	}

}



More information about the geeks mailing list