Re: [opensuse] PDF OCR

13 Dec 2007

      On Dec 13, 07 09:19:11 +0100, Ciaran Farrell wrote:
...
Am Thursday 13 December 2007 schrieb StephenW:
...
--- Roger Oberholtzer  wrote:
...
Hello
We have a network printer that will scan docs and send them as pdf docs
to an e-mail address in the company. Is there any software with OpenSUSE
10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
tiff images of the scanned documents. Any and all pointers are welcome.
I had to do much the same in the past - a quick bash script seemed like the 
best way to solve it:
1. use pdf2ppm to extract the images from the pdf to a new directory
2. use ppm2tiff on all the extracted ppm files
3. use tesseract or whatever its called these days on the tiff files
4. append the text files to a single text file (or leave them separate, 
whatever)
There's probably a much more sensible way of doing this :-) but this worked 
consistently for me for quite a number of documents scanned and sent as pdf.
This is already the best approach, afaik.
I assume ocropus helps layout issus like multicolumn and such.

Any volunteers who want to try out ocropus?
I see rpm packages in
http://download.opensuse.org/repositories/home:/StefanBruens

        cheers,
                Jw.

-- 
 o \  Juergen Weigert  paint it green! __/ _=======.=======_
<V> | jw@suse.de       wide open suse_/        _---|____________\/
 \  | 0911 74053-508         (tm)__/          (____/            /\
(/) | __________________________/             _/ \_ vim:set sw=2 wm=8
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg)
"Novell is committed to creating a work environment that embraces clarity."

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org