[geeks] listing identical files

Doug McLaren dougmc+sunhelp at frenzy.com
Mon Jan 10 13:48:00 CST 2005


On Fri, Nov 19, 2004 at 10:29:59PM +0100, Jochen Kunz wrote:

| On Fri, 19 Nov 2004 12:56:15 -0500 (EST)
| Kevin <kevin at mpcf.com> wrote:
| 
| > Anyone know of a way of listing all the identical files in a
| > directory?
| Look for files of identical size and then use cmp(1) on them? Uses less
| CPU then md5(1).

Only if the number of identical files is very small.

You'd need to cmp every file of a given size against every other file
of a given size.  It's an O(N^2 * S) operation, where N is the number
of files of that exact size and S is the size.

At least with md5sums you can keep each md5sum in memory and quickly
compare it against any other file, no matter how big the file size in
question is.

There are several programs out there that will find identical files.
Usually the first look at file sizes, then get md5sums if needed, and
then maybe do cmp calls to make *sure* that the files it reports as
identical are identical -- but that final step probably isn't needed
in most cases.  (Yes, I'm aware that people can allegedly cause md5
collisions.  It's still pretty hard to find a collision when you're
not trying to force one.  And if you don't like md5sum, pick something
else.)

It could also be sped up somewhat by including an intermediate step,
where you only get a md5sum for the first 16k or so of each file, and
if that matches then you get full md5sums if needed ...

-- 
Doug McLaren, dougmc at frenzy.com
I'm intrigued by your ideas, and would like to subscribe to your newsletter.



More information about the geeks mailing list