[opensuse] OCR and batches

Anders Norrbring

18 Mar 2007 18 Mar '07

09:57

Does anybody know of a way to scan several thousands pictures on disk with an OCR application to look for a specific text, and then list the images where that text was found? -- Anders Norrbring Norrbring Consulting -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Show replies by date

Kai Ponte

18 Mar 18 Mar

14:08

On Sunday 18 March 2007 02:57:49 am Anders Norrbring wrote:

...

Does anybody know of a way to scan several thousands pictures on disk with an OCR application to look for a specific text, and then list the images where that text was found?

I've been doing apps like that for the better part of twelve years. However, I've yet to see an OCR app in Linux. That doesn't mean they don't exist, however, because I'm stuck in a predominantly windows world at work. I know we currently either use OCR For Anydocs or Kofax Ascent. In fact, we're looking at replacing our current systems in the next few years. What you will need to do is probably write some program to take the imaged documents - done so with whatever scanner you've got - and then process the documents through the OCR engine supplied by the manufacturer. Typically this is a library like the AVI or MP3 libraries used by your most commonly requested SUSE applications. Keep in mind, that you'll need to also have a retrieval program of some sort, to actually get the documents and view them - along with the OCR data - in some manner. This is one I wrote in 2003, which combined OCR from barcode and an imaging application based on FileNet: http://www.filesite.org/viewtopic.php?t=173 You didn't mention whether you're doing spot or forms recognition or full-text OCR. You might also look at barcode recognition, because those are VERY reliable, even over fax. Try these links for the OCR software: Google apparently has an OCR engine that is now OSS.. http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.htm... http://sourceforge.net/projects/tesseract-ocr ..never heard of it before this morning. Should be interesting to look at though. Apparently it is an old HP-based software that had been shelved for twelve years and now resurrected. ABBYY is a well-known industrial-strength app for OCR. I've never personally used them (mostly stick with Caere) but have heard great things....AND...they have an SDK for *nix and/or TheCultOfMac. http://www.abbyy.com/sdk/?param=59956 I aslo saw this one... http://www.linux-ocr.ekitap.gen.tr/ Keep in mind that we process over 3M documents/year - that comes out to roughly 15,000 every day, including weekends. We currently have eight high-speed scanners, and are evaluating whether to purchase some new Kodak i860 models at $75,000 each. I just state this so you know our volume. -- kai www.perfectreign.com || www.4thedadz.com www.filesite.org || www.donutmonster.com closing the doors that surround me so no one will ever penetrate complete my retreat just to wait for the day that never comes so i will laugh alone -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Adam Tauno Williams

14:33

...

...
Does anybody know of a way to scan several thousands pictures on disk with an OCR application to look for a specific text, and then list the images where that text was found? I've been doing apps like that for the better part of twelve years. However, I've yet to see an OCR app in Linux. That doesn't mean they don't exist, however, because I'm stuck in a predominantly windows world at work. I know we currently either use OCR For Anydocs or Kofax Ascent. In fact, we're looking at replacing our current systems in the next few years.

There are multiple OCR solutions for Linux: http://jocr.sourceforge.net/ http://sourceforge.net/projects/tesseract-ocr/ http://www.gnu.org/software/ocrad/ocrad.html http://www.geocities.com/claraocr/ I've only ever used gOCR [http://jocr.sourceforge.net/] it worked well enough to develop keyword indexes for images. It does not work well enough (and I've never seen an OCR package that does, commercial or otherwise) to turn images into readable text. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Anders Norrbring

15:08

Adam Tauno Williams skrev:

...

...
...
Does anybody know of a way to scan several thousands pictures on disk with an OCR application to look for a specific text, and then list the images where that text was found? I've been doing apps like that for the better part of twelve years. However, I've yet to see an OCR app in Linux. That doesn't mean they don't exist, however, because I'm stuck in a predominantly windows world at work. I know we currently either use OCR For Anydocs or Kofax Ascent. In fact, we're looking at replacing our current systems in the next few years.

There are multiple OCR solutions for Linux: http://jocr.sourceforge.net/ http://sourceforge.net/projects/tesseract-ocr/ http://www.gnu.org/software/ocrad/ocrad.html http://www.geocities.com/claraocr/

I've only ever used gOCR [http://jocr.sourceforge.net/] it worked well enough to develop keyword indexes for images. It does not work well enough (and I've never seen an OCR package that does, commercial or otherwise) to turn images into readable text.

That's just about what I'm off to do, I'm looking to read the photgraphers name from the images. I'll take a look at gOCR. Thanks! -- Anders Norrbring Norrbring Consulting -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

David Brodbeck

19 Mar 19 Mar

05:20

Adam Tauno Williams wrote:

...

I've only ever used gOCR [http://jocr.sourceforge.net/] it worked well enough to develop keyword indexes for images. It does not work well enough (and I've never seen an OCR package that does, commercial or otherwise) to turn images into readable text.

The only OCR I've ever used that was worth the trouble was Xerox Textbridge. This was a few years ago. I'd dealt with a lot of really crummy OCR apps, so I was duly impressed when I gave it a page out of a magazine, in two-column format with pictures, and it produced nearly flawless text. It even identified the columns properly instead of mixing them together. I think there were two typos (OCRos?) on the whole page. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Anders Norrbring

18 Mar 18 Mar

15:07

Kai Ponte skrev:

...

On Sunday 18 March 2007 02:57:49 am Anders Norrbring wrote:

...
Does anybody know of a way to scan several thousands pictures on disk with an OCR application to look for a specific text, and then list the images where that text was found?

I've been doing apps like that for the better part of twelve years. However, I've yet to see an OCR app in Linux. That doesn't mean they don't exist, however, because I'm stuck in a predominantly windows world at work. I know we currently either use OCR For Anydocs or Kofax Ascent. In fact, we're looking at replacing our current systems in the next few years.

What you will need to do is probably write some program to take the imaged documents - done so with whatever scanner you've got - and then process the documents through the OCR engine supplied by the manufacturer. Typically this is a library like the AVI or MP3 libraries used by your most commonly requested SUSE applications.

Keep in mind, that you'll need to also have a retrieval program of some sort, to actually get the documents and view them - along with the OCR data - in some manner. This is one I wrote in 2003, which combined OCR from barcode and an imaging application based on FileNet: http://www.filesite.org/viewtopic.php?t=173

You didn't mention whether you're doing spot or forms recognition or full-text OCR. You might also look at barcode recognition, because those are VERY reliable, even over fax.

Try these links for the OCR software:

Google apparently has an OCR engine that is now OSS..

http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.htm...

http://sourceforge.net/projects/tesseract-ocr

..never heard of it before this morning. Should be interesting to look at though. Apparently it is an old HP-based software that had been shelved for twelve years and now resurrected.

ABBYY is a well-known industrial-strength app for OCR. I've never personally used them (mostly stick with Caere) but have heard great things....AND...they have an SDK for *nix and/or TheCultOfMac.

http://www.abbyy.com/sdk/?param=59956

I aslo saw this one...

http://www.linux-ocr.ekitap.gen.tr/

Keep in mind that we process over 3M documents/year - that comes out to roughly 15,000 every day, including weekends. We currently have eight high-speed scanners, and are evaluating whether to purchase some new Kodak i860 models at $75,000 each. I just state this so you know our volume.

Thanks, I'm not locked to Linux for this adventure, but the images are stored on a Linux system but can be accessed from Windows. Maybe I should explain more what I'm about to do, it's not really a full text scan.. The images are from a dozen or so different photographers, who all put their copyright notice in text on every image. What I want to accomplish is to categorize them all according to who took the picture, in other words, sort them by photographer name. So, the OCR should only read one or two words out of a maximum of 4-6 words somewhere in the image. It's also a one-time thing to do, so I cannot motivate a license cost for a fully fledged OCR suite. I'll take a look at the links you provided, thanks! -- Anders Norrbring Norrbring Consulting -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Kai Ponte

20:15

On Sunday 18 March 2007 08:07:40 am Anders Norrbring wrote:

...

Thanks, I'm not locked to Linux for this adventure, but the images are stored on a Linux system but can be accessed from Windows.

Maybe I should explain more what I'm about to do, it's not really a full text scan.. The images are from a dozen or so different photographers, who all put their copyright notice in text on every image. What I want to accomplish is to categorize them all according to who took the picture, in other words, sort them by photographer name. So, the OCR should only read one or two words out of a maximum of 4-6 words somewhere in the image. It's also a one-time thing to do, so I cannot motivate a license cost for a fully fledged OCR suite.

I'll take a look at the links you provided, thanks!

Well, I'm only half-joking when I say you might be better off just hiring a few illegal aliens and having them do data entry. OCR really only lends itself to higher volume work that is repeated often. It has high setup costs and takes a huge amount of time to get the confidence levels to anything above 80%. I have staff members, who spend a majority of their time simply tweaking and refining OCR templates to ensure the scanned forms get above 95% accuracy. (When looking at 3M documents times 4+ fields per document, you have roughly 12,000,000 OCR fields per year. 95% accuracy means that you still have 600,000 fields incorrect, which translates to a huge labor cost.) In any case, I wish you luck. That all said, I just checked with SMART and found two OCR items we have available. 1. gOCR - GOCR is a free OCR (Optical Character Recognition) project that provides a library, a command line version, and an X interface. Although the program is in an early development state, the results are very impressive. 2. GNU Ocrad is an OCR (Optical Character Recognition) program implemented as a filter and based on a feature extraction method. It reads a bitmap image in PBM format and outputs text in the ISO-8859-1 (Latin-1) charset. It can be used as a stand-alone console application or as a back-end to other programs. gocr is another interesting command line OCR tool. Both can be plugged into Kooka, the KDE scan and OCR program. -- k -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

6507

Age (days ago)

6508

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Adam Tauno Williams
Anders Norrbring
David Brodbeck
Kai Ponte