[geeks] script language advice

Sat Feb 2 01:35:44 CST 2008

On Fri, 1 Feb 2008, Nadine Miller wrote:

> What language would the collective brain recommend for a script to
> parse lines of up to 7500 chars in length?  I'm leaning towards shell
> or php since I've been doing a lot of tinkering with those of late,
> and my perl is very weak.
>
> I'm trying to sort out duplicate files from 3 computers that I've
> consolidated on one.

I'd use Perl, along with Digest::SHA and Digest::MD5.  Troll the entire
set of files and store the size, MD5 hash, and SHA-1 hash for each file.
Files that have all three fields the same are probably identical.

Since the Digest modules and the stat function all can work with
filenames, it'd be almost no programming at all besides list
manipulation.

> Aside from the line lengths, the biggest bear is that the filesystems
> are fat32, so there's a lot of unusual characters (rsync choked on "?"
> for example) and spaces in the file paths.

I would start with a Unix-like system that can mount SMB shares, create
one directory with three subdirectories, and mount the interesting bits
from each computer over one of the subdirectories.  Point the common
directory at your script and recurse.

The relevant documentation is here:
     http://perldoc.perl.org/functions/stat.html
     http://perldoc.perl.org/Digest/MD5.html
     http://perldoc.perl.org/Digest/SHA.html
     http://perldoc.perl.org/functions/readdir.html

You'd want to opendir the common directory, grab the list of files
within, recurse on anything that's a directory, and hash anything that
isn't.  Then, sort the list by file size and compare the hashes of
entries with the same size.

If you wanted to optimize for speed, you could grab the filenames and
sizes on one pass, and then only hash files that have non-unique sizes.

You could write a similar program in PHP, but you'd need to read and
buffer the files explicitly, as PHP's md5() and sha1() functions only
operate on buffers, not files.

This all assumes, of course, that your duplicate files are identical,
and not just similar in some way.

-- 
Jonathan Patschke
Elgin, TX
USA