The 02.11.20 at 09:51, John Pettigrew wrote:
(PICTURE) d_T j Uly, _' (lr Illy _l C_t-d Ilnl Vtr- sary c_nlumn, l urEURed comput- [snip]
Hmmm. I've not tried OCR in Linux, but from my experience of programs on other platforms (no, not Windows :-) that looks like it's caused by the input bitmap being wrong in some way. Does gocr require a specific bit depth, or resolution/font size?
I don't know.
If it was a greyscale image, was the contrast between the letters and background high enough? If 1-bit, was there any background noise?
It was color; but I thought the software was clever enough to sort that out. After all, its just a question of finding the appropiate level to say "this is ink" or "this is paper". Maybe its not that easy... but I tried to scan as B/W only and the result was dissapointing, too many "dots". It seems I would have to play with the contrast level by trial and error: if I scan at high quality a page, the software should be able to find that thresold on its own.
The thing I've found is that the bitmap that you feed the OCR program needs to be as high quality as possible, and that it matches the specified resolution/font size of the program.
That resolution should be clearly stated by the program! I think image files specify the resolution used (or the real size, from which resoltution can be calculated), so the program can give a warning if it is not appropiate. Clara says it wants 600 dpi (that's a lot).
I never auto-OCR because I often get better results by checking the bitmap before feeding it to the OCR engine, and it saves wasted time when there's something wrong.
Mmmm... -- Cheers, Carlos Robinson