Am Thursday 13 December 2007 schrieb StephenW:
--- Roger Oberholtzer
wrote: Hello
We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome.
I had to do much the same in the past - a quick bash script seemed like the best way to solve it: 1. use pdf2ppm to extract the images from the pdf to a new directory 2. use ppm2tiff on all the extracted ppm files 3. use tesseract or whatever its called these days on the tiff files 4. append the text files to a single text file (or leave them separate, whatever) There's probably a much more sensible way of doing this :-) but this worked consistently for me for quite a number of documents scanned and sent as pdf. Ciaran -- SUSE LINUX Products GmbH GF: Markus Rex HRB 16746 (AG Nuremberg) Maxfeldstrasse 5 90409, Nuremberg Tel: +49 911 74053 262