[geeks] PDF: How to tell if a PDF file is text or image only

Charles Shannon Hendrix shannon at widomaker.com
Tue Dec 12 15:12:45 CST 2006


I need to be able to tell in a shell script if a PDF file is text based
or image based.

What I mean is that some documents are actualy PDF text documents, while
others are image scans only with no text.

I have software which processes text files, but will fail on an image
only PDF file, so I want to detect those early to set them aside
for processing by hand.

Any ideas?

I've thought of scanning the header for "/Subtype /Image", but I'm not
sure that's only used to describe an image only document: it might for
any potential image section or image in a PDF file.

Anyway... ideas and even useful rants are welcome.


-- 
shannon "AT" widomaker.com -- ["Meddle not in the affairs of Wizards, for
thou art crunchy, and taste good with ketchup." -- unknown]



More information about the geeks mailing list