Mailinglist Archive: opensuse (3351 mails)
| < Previous | Next > |
Re: [opensuse] OCR and batches
- From: Kai Ponte <kai@xxxxxxxxxxxxxxxx>
- Date: Sun, 18 Mar 2007 13:15:00 -0700
- Message-id: <200703181315.00690.kai@xxxxxxxxxxxxxxxx>
On Sunday 18 March 2007 08:07:40 am Anders Norrbring wrote:
>
> Thanks, I'm not locked to Linux for this adventure, but the images are
> stored on a Linux system but can be accessed from Windows.
>
> Maybe I should explain more what I'm about to do, it's not really a full
> text scan..
> The images are from a dozen or so different photographers, who all put
> their copyright notice in text on every image. What I want to accomplish
> is to categorize them all according to who took the picture, in other
> words, sort them by photographer name. So, the OCR should only read one
> or two words out of a maximum of 4-6 words somewhere in the image.
> It's also a one-time thing to do, so I cannot motivate a license cost
> for a fully fledged OCR suite.
>
> I'll take a look at the links you provided, thanks!
Well, I'm only half-joking when I say you might be better off just hiring a
few illegal aliens and having them do data entry.
OCR really only lends itself to higher volume work that is repeated often. It
has high setup costs and takes a huge amount of time to get the confidence
levels to anything above 80%. I have staff members, who spend a majority of
their time simply tweaking and refining OCR templates to ensure the scanned
forms get above 95% accuracy. (When looking at 3M documents times 4+ fields
per document, you have roughly 12,000,000 OCR fields per year. 95% accuracy
means that you still have 600,000 fields incorrect, which translates to a
huge labor cost.)
In any case, I wish you luck. That all said, I just checked with SMART and
found two OCR items we have available.
1. gOCR - GOCR is a free OCR (Optical Character Recognition) project that
provides a library, a command line version, and an X interface. Although the
program is in an early development state, the results are very impressive.
2. GNU Ocrad is an OCR (Optical Character Recognition) program implemented
as a filter and based on a feature extraction method. It reads a bitmap
image in PBM format and outputs text in the ISO-8859-1 (Latin-1) charset.
It can be used as a stand-alone console application or as a back-end to
other programs.
gocr is another interesting command line OCR tool.
Both can be
plugged into Kooka, the KDE scan and
OCR program.
--
k
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx
>
> Thanks, I'm not locked to Linux for this adventure, but the images are
> stored on a Linux system but can be accessed from Windows.
>
> Maybe I should explain more what I'm about to do, it's not really a full
> text scan..
> The images are from a dozen or so different photographers, who all put
> their copyright notice in text on every image. What I want to accomplish
> is to categorize them all according to who took the picture, in other
> words, sort them by photographer name. So, the OCR should only read one
> or two words out of a maximum of 4-6 words somewhere in the image.
> It's also a one-time thing to do, so I cannot motivate a license cost
> for a fully fledged OCR suite.
>
> I'll take a look at the links you provided, thanks!
Well, I'm only half-joking when I say you might be better off just hiring a
few illegal aliens and having them do data entry.
OCR really only lends itself to higher volume work that is repeated often. It
has high setup costs and takes a huge amount of time to get the confidence
levels to anything above 80%. I have staff members, who spend a majority of
their time simply tweaking and refining OCR templates to ensure the scanned
forms get above 95% accuracy. (When looking at 3M documents times 4+ fields
per document, you have roughly 12,000,000 OCR fields per year. 95% accuracy
means that you still have 600,000 fields incorrect, which translates to a
huge labor cost.)
In any case, I wish you luck. That all said, I just checked with SMART and
found two OCR items we have available.
1. gOCR - GOCR is a free OCR (Optical Character Recognition) project that
provides a library, a command line version, and an X interface. Although the
program is in an early development state, the results are very impressive.
2. GNU Ocrad is an OCR (Optical Character Recognition) program implemented
as a filter and based on a feature extraction method. It reads a bitmap
image in PBM format and outputs text in the ISO-8859-1 (Latin-1) charset.
It can be used as a stand-alone console application or as a back-end to
other programs.
gocr is another interesting command line OCR tool.
Both can be
plugged into Kooka, the KDE scan and
OCR program.
--
k
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx
| < Previous | Next > |