On Dec 13, 07 09:19:11 +0100, Ciaran Farrell wrote:
Am Thursday 13 December 2007 schrieb StephenW:
--- Roger Oberholtzer
wrote: Hello
We have a network printer that will scan docs and send them as pdf docs to an e-mail address in the company. Is there any software with OpenSUSE 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains tiff images of the scanned documents. Any and all pointers are welcome.
I had to do much the same in the past - a quick bash script seemed like the best way to solve it:
1. use pdf2ppm to extract the images from the pdf to a new directory 2. use ppm2tiff on all the extracted ppm files 3. use tesseract or whatever its called these days on the tiff files 4. append the text files to a single text file (or leave them separate, whatever)
There's probably a much more sensible way of doing this :-) but this worked consistently for me for quite a number of documents scanned and sent as pdf.
This is already the best approach, afaik. I assume ocropus helps layout issus like multicolumn and such. Any volunteers who want to try out ocropus? I see rpm packages in http://download.opensuse.org/repositories/home:/StefanBruens cheers, Jw. -- o \ Juergen Weigert paint it green! __/ _=======.=======_ <V> | jw@suse.de wide open suse_/ _---|____________\/ \ | 0911 74053-508 (tm)__/ (____/ /\ (/) | __________________________/ _/ \_ vim:set sw=2 wm=8 SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg) "Novell is committed to creating a work environment that embraces clarity." -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org