Analyse Image based PDF Files for Text content (great for Fax services)

If you have to analyse image based texted files (like from a fax or scanner), you could try to use a OCR software like tesseract. It works quite well and you can reorganise a complete new source for information.

For myself, i made a script to translate incomming faxes (pdf) files, convert them and store them in my wiki (as file annotated with the textual content). Searchable and semantic annotatable.

i simply use following commands (see attached file):


if [ $# -eq 1 ]; then

