Analyse Image based PDF Files for Text content (great for Fax services)

If you have to analyse image based texted files (like from a fax or scanner), you could try to use a OCR software like tesseract. It works quite well and you can reorganise a complete new source for information.

For myself, i made a script to translate incomming faxes (pdf) files, convert them and store them in my wiki (as file annotated with the textual content). Searchable and semantic annotatable.

i simply use following commands (see attached file):


if [ $# -eq 1 ]; then

PDF="${1##*/}" # extract filename from full path
echo $PDF
mkdir -p $WDIR
cp $1 $WDIR
cd $WDIR

if [ -f $WDIR/$PDF ]; then
echo 'converting'
convert -density 300x300 $PDF $WDIR/output.png
for f in `ls *.png`; do
tesseract $f $f.txt -l deu

for f in `ls *.txt`; do
cat $f >> content.file
exit 0
echo 'aborted. could not determ file.'

echo 'Usage: FILENAME'
exit -1

if you want to use this, please care about the "-l deu" parameter, which helps tesseract to predict german language on the image files.