Pelibali, On Sunday 28 August 2005 03:19, pelibali wrote:
Hi,
On Sun, 28 Aug 2005 00:22:46 -0400
Maura Edelweiss Monville <.> wrote:
I need to transform PDF to text (ascii) ... How can I do that ?
Just a remark. We have plenty of pdfs, which could be converted to plain text exclusively through opt. character recognition (~OCR)! First we also tried to get out the text only, but the trick is, that all of the pages in these pdfs are inserted as tiff files! Only human eyes recognize the bla-bla as _text_, for a computer they stay only _images_...
ACM digitized its library this way, but they included the OCR-ed text _and_ the scanned page images in the PDF files they distribute. When you read or print the document, you see the scanned page images. When you copy or search, the OCR-ed text is used. The OCR is predictably flawed, but the scheme is about the best you can hope for with fully automated digitization of a very large library.
Pelibali
Randall Schulz