The 02.11.19 at 07:53, zentara wrote:
I can't believe clara or gocr is the best I can find in linux, there must be something better :-?
(some ramblings)
O:-)
I think the luck your friend has with windows OCR is due to fonts. OCR software works good if it can match the fonts in it's libraries with the document fonts.
True.
It was probably a windows-made document so the windows-OCR worked well.
No, I used a magazine (an article from IEEE Computer, in fact) - I didn't intend to make it easy: it had dropping caps and figures. I keep one of the tests, and not the best one. This is part of what I got in windows - I have used Abiword to convert from rtf to ascii, and pine for line wrap; otherwise, untouched. The original is a color tiff file; I don't remember what was the name of the software, but it came with an HP scanner, and plugs into M$ Word: +++ Last July, for my first-anniver- i! sary column, I urged comput- Iing professionals to temper pride with humility ("Vanity - and Guilt, Humility and Pride," Computer, July 2001, pp. 104, 102-103). To justify the humility, I wrote that "the computing industry's blunder rate is far higher than it should be, and we must take professional responsibility for it." No one reacted! to this assertion, leaving me unsure if the silence sprang from collegial agreement or dismissive contempt. I But we must ++- (note that the first "L" is a dropping cap) And now, gocr in linux (I have only removed some empty lines to save space): +++ (PICTURE) d_T j Uly, _' (lr Illy _l C_t-d Ilnl Vtr- sary c_nlumn, l urEURed comput- 1' ng prc)fessl' ()nals Co temper prl' de wl' th huml' 11' ty ( " Vanl' ty and GJUl' It, HUnll' Il' ty and Prlde, " Cr)m_Mtey, July 200 1, pp. 1 04, l 02- 1 U.3). T(_ _' us_tl' fv, the huml' Il' ty, I wrote that " the c_umputt' ng t'nduStry'S blunder rate l' s t_ar hl'gher Chan 1_t shuuld be, an d we must take professl' onal responsl' bl' 11_ty fur 1' C. " Nu one reacCed Co thl' s assertl' nn, leavl' ng me unsure 1' f Che Sl' len__C Sprang ffom Collegl'al agreC- ment or dl' sml' ssJ've contempt. BUt We mUSC relnel_ber the bl underS ++- Well... that is very dissapointing. I must say, however, that the other day I tried the first page of The Silmarillion, and gocr got it much better than the above - I had to set the scanner for gray, though.
You try to scan it on linux, and do OCR, and get bad results because linux dosn't know about the window's font that was used.
It should be able to learn. Clara claims to do it... And gocr complains about some database not found.
Matching the document font to the fonts available to the OCR program is the key. So it's no surprise that windows OCR works better than linux OCR since most of the documents out there were made with windows.
Possibly. But I understand that some ieee articles are or were made with tex (transactions?). I don't suppose a printing press uses windows, I have not seen the "abort retry fail" after a blank page yet X-) But what linux ocr program are you talking about? Gocr is not even close. It produces plain text, the windows program produced a rtf even with font and format information - not perfect, but acceptable. -- Cheers, Carlos Robinson