Re: [SLE] PDF to TXT (ascii) ... helppppppp
This is a reply I sent to Maura before I saw all the relies regarding the subject on the list. ------------------------------------------------------------------ Maura, I could be wrong but from looking at the file it appears the text is not font based. So it appears the only way to extract the text is using OCR and given the number of symbols used in the file it is going to be a hard task. I tried extracting the text using a Java based tool called Multivalent.jar which can be found at:- http://multivalent.sourceforge.net/ From the text extraction command manual there is the following note:- --------------------------------------------------------------------- Note that sometimes text can be drawn not with fonts but with vector shapes or in an image; to extract this, run OCR software. see http://multivalent.sourceforge.net/Tools/doc/ExtractText.html ---------------------------------------------------------------------- This more than likely explains why we are getting crap when we try to extract the text using either Acroread or Kword. I notice that when selecting a line of text in Acroread it will not cleanly identify the line, thus indicating that it is graphics and not text. (Check out the Cursor). I don't know of any good OCR packages in Linux good enough to extract the text from a pdf. So you will need to look at some of the Windows packages specialising in doing pdf OCR. Some of these may run under Wine or Crossover. Have you tried contacting the Author or the University to see if you can get the document in any other format? -- Regards, Graham Smith
participants (1)
-
Graham Smith