[geeks] script language advice

Sat Feb 2 12:12:07 CST 2008

Shannon Hendrix wrote:
> On Feb 1, 2008, at 10:53 PM, Nadine Miller wrote:
> 
>> What language would the collective brain recommend for a script to  
>> parse
>> lines of up to 7500 chars in length?  I'm leaning towards shell or php
>> since I've been doing a lot of tinkering with those of late, and my  
>> perl
>> is very weak.
> 
> PHP is the Microsoft of programming languages, but it's your brain... :)

Well, I'm no genius at any scripting language, but PHP is pretty much a 
necessity if you work with OSS web apps, which I've been doing lately 
for my own sites.  So it's syntax is "fresh" in my mind.  I haven't 
touched any perl in a long time.

> Shell is very powerful, but also very slow and looks line line noise.
> 
> Do you care what the names of the duplicate file listings are?
> 
> The basic algorithm:
> 
[snip pseudo code]

>> Aside from the line lengths, the biggest bear is that the filesystems
>> are fat32, so there's a lot of unusual characters (rsync choked on "?"
>> for example) and spaces in the file paths.
> 
> How did you get such long lengths from fat32?
> 
> I thought it had a 256 character total limit?

Based on your and Jonathan's responses, apparently I didn't make it 
clear that I have the list of duplicates--fslint generated that already. 
  Now I need to parse the output of fslint to get down to one copy per file.

The dup list is all on one line.  Thus the need to parse such long 
lines.  A pseudo-example:

num of dupes * filesize  /path/to/file/filename /path/to/file/filename2 
/path/to/filename3 [...]

So, for a file that has 1632 copies (yes, even I !! at that, and I'm a 
pack rat :) ), even if the file's path is short, the line length is 
humongous.

I also have to be very careful, as this is me cleaning up my dad's 
computer data--he passed away recently, and I'm the only one equipped to 
handle this task.  Unfortunately, his backup plan was "make copies 
everywhere".

I guess I should have explicitly asked, what language and/or command 
line utilities can parse a line length of ~7500 chars without choking? 
Can (g)awk handle a line 7500 chars long?

Now that I think about the process though, I realize I'm trying to 
reinvent the wheel.  I could run fslint with the delete option on 
smaller, related subsets of directories to remove duplicates before 
running the tool over the entire fileset.

I had never used fslint before so wanted to make sure of the method it 
was using to find dupes, since the documentation is lacking.  I just 
generated the list of dupes on the first run.  I monitored it while it 
was running and it generates both md5 sums as well as SHA-1 to determine 
duplicate files.  I don't think running the gui over the entire fileset 
is wise, since the box it's on is somewhat memory restricted, and 
there's in excess of 135K duplicate files.

Sorry for the noise.

=Nadine=