In a previous message, Carlos E. R. wrote:
The 02.11.20 at 09:51, John Pettigrew wrote:
Does gocr require a specific bit depth, or resolution/font size?
I don't know.
It's *really worth finding this out! Otherwise, you suffer badly from GIGO.
I tried to scan as B/W only and the result was dissapointing, too many "dots".
That's just a problem of the contrast setting when you scanned. You need to tweak the scanning controls until you get a clean scan. This will usually not change much for a given scanner (unless you have very yellow paper or a strong background colour) so saving the setting is worth it.
if I scan at high quality a page, the software should be able to find that thresold on its own.
That's it - the higher quality the source image, the better the OCR software will do at character recognition. This is one area that consumer OCR applications have the advantage - someone's spent time putting serious image manipulation algorithms in there. The problem with finding the threshold is that it can be hard - as you saw, you can end up with many extraneous noise dots, or (conversely) with missing parts of letters. This is why paying attention to the image quality is crucial. If the required information isn't clear in the bitmap, you are crippling the OCR engine. This is where consumer OCR apps have another advantage, though - the investment in more sophisticated character recognition in poor conditions (usually coupled with huge dictionaries for context checking).
The thing I've found is that the bitmap that you feed the OCR program needs to be as high quality as possible, and that it matches the specified resolution/font size of the program.
That resolution should be clearly stated by the program!
Absolutely. I've not got gocr installed, so can't easily check, but if there is such a requirement, it should (as you say) be in the man page or other documentation. FWIW, most general OCR apps seem to consider 12 pt text at 300dpi to be a good starting point.
I think image files specify the resolution used (or the real size, from which resoltution can be calculated), so the program can give a warning if it is not appropiate.
Some image formats do contain resolution information, but for OCR it's not actually crucial. The important point is the relationship between physical size of the text (e.g. 12pt) and the resolution (e.g. 300dpi). That is, for a smaller text, you need to increase the resolution, and vice versa. HTH, John -- John Pettigrew Headstrong Games john@headstrong-games.co.uk Fun : Strategy : Price http://www.headstrong-games.co.uk/ Board games that won't break the bank Valley of the Kings: ransack an ancient Egyptian tomb but beware of mummies!