Re: [SLE] PDF to TXT (ascii) ... helppppppp

29 Aug 2005

      This is a reply I sent to Maura before I saw all the relies regarding the 
subject on the list.
------------------------------------------------------------------
Maura,

I could be wrong but from looking at the file it appears the text is not font 
based. So it appears the only way to extract the text is using OCR and given 
the number of symbols used in the file it is going to be a hard task. 

I tried extracting the text using a Java based tool called Multivalent.jar 
which can be found at:-

http://multivalent.sourceforge.net/

From the text extraction command manual there is the following note:-
---------------------------------------------------------------------
Note that sometimes text can be drawn not with fonts but with vector shapes or 
in an image; to extract this, run OCR software.

see
http://multivalent.sourceforge.net/Tools/doc/ExtractText.html
----------------------------------------------------------------------

This more than likely explains why we are getting crap when we try to extract 
the text using either Acroread or Kword. I notice that when selecting a line 
of text in Acroread it will not cleanly identify the line, thus indicating 
that it is graphics and not text. (Check out the Cursor).

I don't know of any good OCR packages in Linux good enough to extract the text 
from a pdf. So you will need to look at some of the Windows packages 
specialising in doing pdf OCR. Some of these may run under Wine or Crossover.

Have you tried contacting the Author or the University to see if you can get 
the document in any other format?

-- 
Regards,

Graham Smith

Graham Smith

tags

participants (1)