What can I use for OCR? I have tried gocr, which seems usable. Also clara, which seems horrible to me: only accepts files in pbm format, and so far produced nothing: It's been working on another window for 15 minutes or more, 90% CPU of a pentium IV, and no text output so far. Aparently, I have got to "train it", but it took 5 minutes to learn the letter "A", having to wait for it to reddraw after each click for minutes. There is something very wrong there... I hate to say this, but a friend demonstrated to me his Ms-word scanning a page, and it almost even got the format (typesetting) correct! I can't believe clara or gocr is the best I can find in linux, there must be something better :-? -- Cheers, Carlos Robinson
On Monday 18 November 2002 07:14 pm, Carlos E. R. wrote:
What can I use for OCR?
I have tried gocr, which seems usable.
Also clara, which seems horrible to me: only accepts files in pbm format, and so far produced nothing: It's been working on another window for 15 minutes or more, 90% CPU of a pentium IV, and no text output so far. Aparently, I have got to "train it", but it took 5 minutes to learn the letter "A", having to wait for it to reddraw after each click for minutes. There is something very wrong there...
I hate to say this, but a friend demonstrated to me his Ms-word scanning a page, and it almost even got the format (typesetting) correct!
I can't believe clara or gocr is the best I can find in linux, there must be something better :-?
------------------------------ Hmmm Carlos, Last time I tried OCR scanning, I was quite surprised at how well it did and how quickly it did the job! Don't know if the difference is the scanner (Epson) or the hardware it's connected too, but I would tend to think your experience is not the norm. I would go so far to say that my experience equals or exceeds that of your friends M$ scan. Don't know what to tell you here. Guess you could dual boot if Linux seems unable to satisfy your OCR scanning needs. Patrick --- KMail v1.4.3 --- SuSE Linux Pro v8.1 --- Registered Linux User #225206
On Monday 18 November 2002 20:45, Patrick wrote:
Last time I tried OCR scanning, I was quite surprised at how well it did and how quickly it did the job!
What Linux application did you use for the scanning? Thank you. *************************************************** Powered by SuSE Linux 8.0 Professional KDE 3.0.0 KMail 1.4 This is a Microsoft-free computer Bryan S. Tyson bryantyson@earthlink.net ***************************************************
The 02.11.18 at 20:45, Patrick wrote:
I can't believe clara or gocr is the best I can find in linux, there must be something better :-?
------------------------------
Hmmm Carlos, Last time I tried OCR scanning, I was quite surprised at how well it did and how quickly it did the job! Don't know if the difference is the scanner (Epson) or the hardware it's connected too, but I would tend to think your experience is not the norm. I would go so far to say that my experience equals or exceeds that of your friends M$ scan. Don't know what to tell you here. Guess you could dual boot if Linux seems unable to satisfy your OCR scanning needs.
My scanner is an Epson 1650, and my hardwware is a Pentium IV at 1.8 Ghz - thats reasonably fast, I should say - and I haven't bothered to install it in windows. What do yo use for OCR in linux, that you are so happy? What I saw in windows is that Word got the page almost fully correct, including font types, sizes, and placement. I mean, not only the words were correct, but the format was also correct, or almost. It could even scan a page with photos and place them correctly as well. I think it was a plugin thant came with an HP scanner - for windows only. Frankly, that's far from gocr.tcl, that only outputs plain text, with errors. It doesn't even allow for corrections (training), or I didn't see it. In the options it says something about a database, but if enabled it complains that it does not exist! Clara does accept training, but so much that it outputs nothing after half an hour thinking! Even browsing the image made the program fully unresponsive, using 99% CPU, not responding to mouse clicks. Clicking on some places crashed it with segfault. Horrible. Maybe it is broken in suse 8.1 And I didn't even see OCR in the OOs menu. I don't know if staroffice has it. -- Cheers, Carlos Robinson
On Wed, 20 Nov 2002 01:16 am, Carlos E. R. wrote:
SNIPPED Clara does accept training, but so much that it outputs nothing after half an hour thinking! Even browsing the image made the program fully unresponsive, using 99% CPU, not responding to mouse clicks. Clicking on some places crashed it with segfault. Horrible. Maybe it is broken in suse 8.1
Clara is designed to scan manuscripts & books where you have mainly the one font type. It really was not designed for doing normal word processing documents or forms where you have a lot of different typefaces on the one page. It is a work in progress from what I have seen but the development has gone cold in the past 6 months. Regards, Graham Smith ---------------------------------------------------------
The 02.11.20 at 13:15, Graham Smith wrote:
SNIPPED Clara does accept training, but so much that it outputs nothing after half an hour thinking! Even browsing the image made the program fully unresponsive, using 99% CPU, not responding to mouse clicks. Clicking on some places crashed it with segfault. Horrible. Maybe it is broken in suse 8.1
Clara is designed to scan manuscripts & books where you have mainly the one font type. It really was not designed for doing normal word processing documents or forms where you have a lot of different typefaces on the one page. It is a work in progress from what I have seen but the development has gone cold in the past 6 months.
I can not even convert a small page... it almost freezes, is totally unresponsive to mouse clicks. -- Cheers, Carlos Robinson
On 19-Nov-02 Carlos E. R. wrote:
What can I use for OCR?
I have tried gocr, which seems usable.
Also clara, which seems horrible to me: only accepts files in pbm format, and so far produced nothing: It's been working on another window for 15 minutes or more, 90% CPU of a pentium IV, and no text output so far. Aparently, I have got to "train it", but it took 5 minutes to learn the letter "A", having to wait for it to reddraw after each click for minutes. There is something very wrong there...
I hate to say this, but a friend demonstrated to me his Ms-word scanning a page, and it almost even got the format (typesetting) correct!
I can't believe clara or gocr is the best I can find in linux, there must be something better :-?
-- Cheers, Carlos Robinson
OCR has been the Cinderella application of the Linux world. Until
relatively recently there was nothing at all which worked well
enough for use in the real world. Xocr was hopeless. Gocr, as
Carlos says, is (sort of) usable, but lacks (or lacked, when I
tried it) the sort of sophistication needed for real use. I never
tried clara.
And, as Carlos says, MS Windows OCR applications really do work.
Even the OCR software given away on the CD that comes with a
scanner works well -- for some time I ran one of these on Linux
using WABI, and got very good results (the software dated from
about 1995, and came on a floppy ... ).
Now, however, there is at least one good commercial OCR package
available in a Linux version: OCRshop. See
http://www.vividata.com
In some respects this has (or had, in the oldish version I have
used) a few rough edges. Nevertheless, it is powerful, capable and
reliable, and can produce output in a variety of formats.
By the way, a file format which lends itself very well to OCR is
the TIFF format used in faxes. For some reason, character recognition
in this format is particularly reliable.
Best wishes to all,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding)
The 02.11.19 at 08:06, Ted.Harding@nessie.mcc.ac.uk wrote:
OCR has been the Cinderella application of the Linux world.
That's what I'm starting to think.
Until relatively recently there was nothing at all which worked well enough for use in the real world. Xocr was hopeless. Gocr, as Carlos says, is (sort of) usable, but lacks (or lacked, when I tried it) the sort of sophistication needed for real use. I never tried clara.
Either it is broken in suse 8.1, or it is that way. It used 99% CPU, and was unresponsive to the mouse. No output whatsoever.
Now, however, there is at least one good commercial OCR package available in a Linux version: OCRshop. See
Thanks, I'll take a note, but it is not justified in my case.
By the way, a file format which lends itself very well to OCR is the TIFF format used in faxes. For some reason, character recognition in this format is particularly reliable.
It seems that gocr prefers .pnm, and clara only accepts .pbm (monochrome, 1 bit, no grays .pnm file). -- Cheers, Carlos Robinson
On Tue, 19 Nov 2002 01:14:06 +0100 (CET)
"Carlos E. R."
What can I use for OCR?
I have tried gocr, which seems usable.
I hate to say this, but a friend demonstrated to me his Ms-word scanning a page, and it almost even got the format (typesetting) correct!
I can't believe clara or gocr is the best I can find in linux, there must be something better :-?
(some ramblings) I think the luck your friend has with windows OCR is due to fonts. OCR software works good if it can match the fonts in it's libraries with the document fonts. It was probably a windows-made document so the windows-OCR worked well. You try to scan it on linux, and do OCR, and get bad results because linux dosn't know about the window's font that was used. Matching the document font to the fonts available to the OCR program is the key. So it's no surprise that windows OCR works better than linux OCR since most of the documents out there were made with windows. Try making a document with a linux font, and give it to your friend to OCR. As an related aside, I was listening to a report about Microsoft's new "tablet" computer, (I forget the name). It is a laptop, with a screen that also acts as a "writing tablet". You use a stylus to write messages in longhand instead of typing. The drawback is that these "messsages" are not "editable" because they are a graphic. The reporter said that Microsoft was working on an OCR program for handwriting, but I wouldn't count on them getting it working anytime soon. -- use Perl; #powerful programmable prestidigitation
In a previous message, zentara wrote:
As an related aside, I was listening to a report about Microsoft's new "tablet" computer, (I forget the name). [snip] The drawback is that these "messsages" are not "editable" because they are a graphic.
To be fair - from what I've read about it, the TabletPC does, in fact, do handwriting recognition, and pretty well, too. As you said, it would be pretty useless otherwise! It has a new version of XP that does the handwriting recognition - although, apparently, apps need to be updated to work with it fully (otherwise, you have to use a special "handwriting" area of the screen). Getting pretty OT, though! John -- John Pettigrew Headstrong Games john@headstrong-games.co.uk Fun : Strategy : Price http://www.headstrong-games.co.uk/ Board games that won't break the bank Knossos: escape the ever-changing labyrinth before the Minotaur catches you!
The 02.11.19 at 14:22, John Pettigrew wrote:
To be fair - from what I've read about it, the TabletPC does, in fact, do handwriting recognition, and pretty well, too. As you said, it would be pretty useless otherwise! It has a new version of XP that does the handwriting recognition - although, apparently, apps need to be updated to work with it fully (otherwise, you have to use a special "handwriting" area of the screen).
As a matter of fact (unless my memory misleads me), the windows 3.10 API did have support for this. It's old stuff... I never tried it, but it was there in the docs. -- Cheers, Carlos Robinson
In a previous message, Carlos E. R. wrote:
The 02.11.19 at 14:22, John Pettigrew wrote:
It has a new version of XP that does the handwriting recognition - although, apparently, apps need to be updated to work with it fully (otherwise, you have to use a special "handwriting" area of the screen).
As a matter of fact (unless my memory misleads me), the windows 3.10 API did have support for this. It's old stuff...
Hmmm. Perhaps they've changed it... I'm only quoting what I've read in the press - no inside information. John -- John Pettigrew Headstrong Games john@headstrong-games.co.uk Fun : Strategy : Price http://www.headstrong-games.co.uk/ Board games that won't break the bank Knossos: escape the ever-changing labyrinth before the Minotaur catches you!
The 02.11.19 at 07:53, zentara wrote:
I can't believe clara or gocr is the best I can find in linux, there must be something better :-?
(some ramblings)
O:-)
I think the luck your friend has with windows OCR is due to fonts. OCR software works good if it can match the fonts in it's libraries with the document fonts.
True.
It was probably a windows-made document so the windows-OCR worked well.
No, I used a magazine (an article from IEEE Computer, in fact) - I didn't intend to make it easy: it had dropping caps and figures. I keep one of the tests, and not the best one. This is part of what I got in windows - I have used Abiword to convert from rtf to ascii, and pine for line wrap; otherwise, untouched. The original is a color tiff file; I don't remember what was the name of the software, but it came with an HP scanner, and plugs into M$ Word: +++ Last July, for my first-anniver- i! sary column, I urged comput- Iing professionals to temper pride with humility ("Vanity - and Guilt, Humility and Pride," Computer, July 2001, pp. 104, 102-103). To justify the humility, I wrote that "the computing industry's blunder rate is far higher than it should be, and we must take professional responsibility for it." No one reacted! to this assertion, leaving me unsure if the silence sprang from collegial agreement or dismissive contempt. I But we must ++- (note that the first "L" is a dropping cap) And now, gocr in linux (I have only removed some empty lines to save space): +++ (PICTURE) d_T j Uly, _' (lr Illy _l C_t-d Ilnl Vtr- sary c_nlumn, l urEURed comput- 1' ng prc)fessl' ()nals Co temper prl' de wl' th huml' 11' ty ( " Vanl' ty and GJUl' It, HUnll' Il' ty and Prlde, " Cr)m_Mtey, July 200 1, pp. 1 04, l 02- 1 U.3). T(_ _' us_tl' fv, the huml' Il' ty, I wrote that " the c_umputt' ng t'nduStry'S blunder rate l' s t_ar hl'gher Chan 1_t shuuld be, an d we must take professl' onal responsl' bl' 11_ty fur 1' C. " Nu one reacCed Co thl' s assertl' nn, leavl' ng me unsure 1' f Che Sl' len__C Sprang ffom Collegl'al agreC- ment or dl' sml' ssJ've contempt. BUt We mUSC relnel_ber the bl underS ++- Well... that is very dissapointing. I must say, however, that the other day I tried the first page of The Silmarillion, and gocr got it much better than the above - I had to set the scanner for gray, though.
You try to scan it on linux, and do OCR, and get bad results because linux dosn't know about the window's font that was used.
It should be able to learn. Clara claims to do it... And gocr complains about some database not found.
Matching the document font to the fonts available to the OCR program is the key. So it's no surprise that windows OCR works better than linux OCR since most of the documents out there were made with windows.
Possibly. But I understand that some ieee articles are or were made with tex (transactions?). I don't suppose a printing press uses windows, I have not seen the "abort retry fail" after a blank page yet X-) But what linux ocr program are you talking about? Gocr is not even close. It produces plain text, the windows program produced a rtf even with font and format information - not perfect, but acceptable. -- Cheers, Carlos Robinson
In a previous message, Carlos E. R. wrote:
And now, gocr in linux (I have only removed some empty lines to save space):
+++ (PICTURE) d_T j Uly, _' (lr Illy _l C_t-d Ilnl Vtr- sary c_nlumn, l urEURed comput- [snip]
Hmmm. I've not tried OCR in Linux, but from my experience of programs on other platforms (no, not Windows :-) that looks like it's caused by the input bitmap being wrong in some way. Does gocr require a specific bit depth, or resolution/font size? If it was a greyscale image, was the contrast between the letters and background high enough? If 1-bit, was there any background noise? The thing I've found is that the bitmap that you feed the OCR program needs to be as high quality as possible, and that it matches the specified resolution/font size of the program. I never auto-OCR because I often get better results by checking the bitmap before feeding it to the OCR engine, and it saves wasted time when there's something wrong. John -- John Pettigrew Headstrong Games john@headstrong-games.co.uk Fun : Strategy : Price http://www.headstrong-games.co.uk/ Board games that won't break the bank Fields of Valour: 2 Norse clans battle on one of 3 different boards
On Wednesday 20 November 2002 04:51 am, John Pettigrew wrote:
In a previous message, Carlos E. R. wrote:
And now, gocr in linux (I have only removed some empty lines to save space):
+++ (PICTURE) d_T j Uly, _' (lr Illy _l C_t-d Ilnl Vtr- sary c_nlumn, l urEURed comput-
[snip]
Hmmm. I've not tried OCR in Linux, but from my experience of programs on other platforms (no, not Windows :-) that looks like it's caused by the input bitmap being wrong in some way. Does gocr require a specific bit depth, or resolution/font size? If it was a greyscale image, was the contrast between the letters and background high enough? If 1-bit, was there any background noise?
The thing I've found is that the bitmap that you feed the OCR program needs to be as high quality as possible, and that it matches the specified resolution/font size of the program. I never auto-OCR because I often get better results by checking the bitmap before feeding it to the OCR engine, and it saves wasted time when there's something wrong.
John
-------------------------- Another good point is that when using OCR, scan in Binary mode, not greyscale. I have used OCR in Kooka with very good results, scanning in as Binary and the higher the resolution you use the greater the results it seems in character recognition. I know xsane uses gocr and I did some scans there last evening. The page looked good, although I am not experienced in xsane or gocr, so unsure if I was even doing OCR scans at the time. Kooka OCR has worked well for me with good results. It may also be using gocr, but I couldn't find anything that indicated that. Patrick --- KMail v1.4.3 --- SuSE Linux Pro v8.1 --- Registered Linux User #225206
The 02.11.20 at 09:51, John Pettigrew wrote:
(PICTURE) d_T j Uly, _' (lr Illy _l C_t-d Ilnl Vtr- sary c_nlumn, l urEURed comput- [snip]
Hmmm. I've not tried OCR in Linux, but from my experience of programs on other platforms (no, not Windows :-) that looks like it's caused by the input bitmap being wrong in some way. Does gocr require a specific bit depth, or resolution/font size?
I don't know.
If it was a greyscale image, was the contrast between the letters and background high enough? If 1-bit, was there any background noise?
It was color; but I thought the software was clever enough to sort that out. After all, its just a question of finding the appropiate level to say "this is ink" or "this is paper". Maybe its not that easy... but I tried to scan as B/W only and the result was dissapointing, too many "dots". It seems I would have to play with the contrast level by trial and error: if I scan at high quality a page, the software should be able to find that thresold on its own.
The thing I've found is that the bitmap that you feed the OCR program needs to be as high quality as possible, and that it matches the specified resolution/font size of the program.
That resolution should be clearly stated by the program! I think image files specify the resolution used (or the real size, from which resoltution can be calculated), so the program can give a warning if it is not appropiate. Clara says it wants 600 dpi (that's a lot).
I never auto-OCR because I often get better results by checking the bitmap before feeding it to the OCR engine, and it saves wasted time when there's something wrong.
Mmmm... -- Cheers, Carlos Robinson
In a previous message, Carlos E. R. wrote:
The 02.11.20 at 09:51, John Pettigrew wrote:
Does gocr require a specific bit depth, or resolution/font size?
I don't know.
It's *really worth finding this out! Otherwise, you suffer badly from GIGO.
I tried to scan as B/W only and the result was dissapointing, too many "dots".
That's just a problem of the contrast setting when you scanned. You need to tweak the scanning controls until you get a clean scan. This will usually not change much for a given scanner (unless you have very yellow paper or a strong background colour) so saving the setting is worth it.
if I scan at high quality a page, the software should be able to find that thresold on its own.
That's it - the higher quality the source image, the better the OCR software will do at character recognition. This is one area that consumer OCR applications have the advantage - someone's spent time putting serious image manipulation algorithms in there. The problem with finding the threshold is that it can be hard - as you saw, you can end up with many extraneous noise dots, or (conversely) with missing parts of letters. This is why paying attention to the image quality is crucial. If the required information isn't clear in the bitmap, you are crippling the OCR engine. This is where consumer OCR apps have another advantage, though - the investment in more sophisticated character recognition in poor conditions (usually coupled with huge dictionaries for context checking).
The thing I've found is that the bitmap that you feed the OCR program needs to be as high quality as possible, and that it matches the specified resolution/font size of the program.
That resolution should be clearly stated by the program!
Absolutely. I've not got gocr installed, so can't easily check, but if there is such a requirement, it should (as you say) be in the man page or other documentation. FWIW, most general OCR apps seem to consider 12 pt text at 300dpi to be a good starting point.
I think image files specify the resolution used (or the real size, from which resoltution can be calculated), so the program can give a warning if it is not appropiate.
Some image formats do contain resolution information, but for OCR it's not actually crucial. The important point is the relationship between physical size of the text (e.g. 12pt) and the resolution (e.g. 300dpi). That is, for a smaller text, you need to increase the resolution, and vice versa. HTH, John -- John Pettigrew Headstrong Games john@headstrong-games.co.uk Fun : Strategy : Price http://www.headstrong-games.co.uk/ Board games that won't break the bank Valley of the Kings: ransack an ancient Egyptian tomb but beware of mummies!
The 02.11.22 at 08:50, John Pettigrew wrote:
Does gocr require a specific bit depth, or resolution/font size?
I don't know.
It's *really worth finding this out! Otherwise, you suffer badly from GIGO.
What's GIGO? :-? I think I will need an "English email acronym dictionary" O:-)
I tried to scan as B/W only and the result was dissapointing, too many "dots".
That's just a problem of the contrast setting when you scanned. You need to tweak the scanning controls until you get a clean scan. This will usually not change much for a given scanner (unless you have very yellow paper or a strong background colour) so saving the setting is worth it.
That's why I tried grays or color first, thinking that the ocr software would sort that out for itself. But the "clara" program wants B/W instead (.pbm format only).
if I scan at high quality a page, the software should be able to find that thresold on its own.
That's it - the higher quality the source image, the better the OCR software will do at character recognition. This is one area that consumer OCR applications have the advantage - someone's spent time putting serious image manipulation algorithms in there.
I'm afraid that must be the problem.
specified resolution/font size of the program.
That resolution should be clearly stated by the program!
Absolutely. I've not got gocr installed, so can't easily check, but if there is such a requirement, it should (as you say) be in the man page or other documentation.
I think it should be somewhere on the program menu itself.
FWIW, most general OCR apps seem to consider 12 pt text at 300dpi to be a good starting point.
Yes, that's the resolution I tried.
I think image files specify the resolution used (or the real size, from which resoltution can be calculated), so the program can give a warning if it is not appropiate.
Some image formats do contain resolution information, but for OCR it's not actually crucial. The important point is the relationship between physical size of the text (e.g. 12pt) and the resolution (e.g. 300dpi). That is, for a smaller text, you need to increase the resolution, and vice versa.
That's understandable :-) -- Cheers, Carlos Robinson
On Friday 22 November 2002 15.17, Carlos E. R. wrote:
The 02.11.22 at 08:50, John Pettigrew wrote:
Does gocr require a specific bit depth, or resolution/font size?
I don't know.
It's *really worth finding this out! Otherwise, you suffer badly from GIGO.
What's GIGO? :-?
I think I will need an "English email acronym dictionary" O:-)
Use "dict", it's an insanely useful tool on the net # dict gigo
From Jargon File (4.3.0, 30 APR 2001) [jargon]:
GIGO /gi:'goh/ [acronym] 1. `Garbage In, Garbage Out' -- usually said in response to {luser}s who complain that a program didn't "do the right thing" when given imperfect input or otherwise mistreated in some way. Also commonly used to describe failures in human decision making due to faulty, incomplete, or imprecise data. 2. `Garbage In, Gospel Out': this more recent expansion is a sardonic comment on the tendency human beings have to put excessive trust in `computerized' data.
The 02.11.22 at 19:04, Anders Johansson wrote:
What's GIGO? :-?
I think I will need an "English email acronym dictionary" O:-)
Use "dict", it's an insanely useful tool on the net
Ah, but I'm usually offline when reading emails :-)
From Jargon File (4.3.0, 30 APR 2001) [jargon]:
Now I understand - I'll put reading gocr docs into my inpu queue O:-) -- Cheers, Carlos Robinson
Hi Carlos, GIGO stands for, "Garbage In, Garbage Out". I've often used, "Garbage In, Gospel Out", to explain the actions of PHBs. Regards, Lew Wolfgang On Fri, 22 Nov 2002, Carlos E. R. wrote:
The 02.11.22 at 08:50, John Pettigrew wrote:
Does gocr require a specific bit depth, or resolution/font size?
I don't know.
It's *really worth finding this out! Otherwise, you suffer badly from GIGO.
What's GIGO? :-?
I think I will need an "English email acronym dictionary" O:-)
participants (9)
-
Anders Johansson
-
Bryan Tyson
-
Carlos E. R.
-
Graham Smith
-
John Pettigrew
-
Lewie Wolfgang
-
Patrick
-
Ted.Harding@nessie.mcc.ac.uk
-
zentara