Re: [SLE] PDF to TXT (ascii) ... helppppppp

29 Aug 2005

      On 29-Aug-05 FX Fraipont wrote:
...
Maura Edelweiss Monville wrote:
...
I downloaded  Acroread 7.0 from the site you indicated.
I still have Acroread 6.0 installed.
Shall I uninstall the current version before installing the new one   
through Yast ?
To install the new version is the following command  right:
#rpm -Uvh    AdobeReader_enu-7.0.0-2.i386.rpm
yes.
Uninstalling 6 is not necessary.
But wait, I'm trying out the various solutions I proposed on the file 
you sent, and they are NOT convincing to say the least ...
Acrobat reader 7  saved to *.txt > gibberish
Acrobat reader 7 copy text paste to Openoffice > gibberish
Open pdf in Kword > gibberish
pdftk > useless
pdf2html > gibberish
I've just installed acrobat 5 for windows through crossover wine, and 
export to txt > gibberish
I've then downloaded Acrobar Reader professional tryout version for 
windows (100mb) and installed it on my son's computer. It opens your
pdf 
fine, but saving as .txt has it choke on page 21.
I have used the recommended conversion methos repeatedly in the past, 
and they have worked fairly well, but I had never used it with a file
as 
complex as the dreaded Bagheri file. Have you considered  getting in 
touch with the author and requesting another format?
I'm afraid this is the only thing I can recommend, having
unsuccessfully tried all the other options :-(
fx
Maura also (at my request) sent the file to me. (It weighs in at
4.9MB as a 70-page PDF [with some graphics] and when exported from
acroread as a PS file it becomes 85MB).

I have made a serious attempt to analyse the PS file. It's a
monstrous nightmare.

At the end I have a few questions for anyone who may have the
expertise to take the matter further, but it seems very likely
that all of the usual resources (such as many have suggested)
will simply fail.

The basic issue, it seems to me, is that (apart from one tiny patch
in the middle of a Figure which may have been imported from some
other application), the entire text is in bit-mapped glyphs set up
as Type 3 fonts. There is a total of 8 "user-defined" Type 3 fonts.

Each page begins with a set of font definitions. For each font, there
is an encoding vector followed by glyph definitions on the lines of

T3Defs
/FontType 3 def
/FontName /N9 def
/FontMatrix [1 
...
]def
/Encoding [/a0 /a1 /a2 /a3 /a4 /a5 /a6 /a7 /a8 /a9 /a10 /a11
/a12 /a13 /a14 /a15 /a16 /a17 /a18 /a19 /a20 /a21 /a22 /a23
/a24 /a25 /a26 /a27 /a28 /a29 /a30 /a31 /a32 /a33 /a34 /a35
...
/a228 /a229 /a230 /a231 /a232 /a233 /a234 /a235 /a236 /a237
/a238 /a239 /a240 /a241 /a242 /a243 /a244 /a245 /a246 /a247
/a248 /a249 /a250 /a251 /a252 /a253 /a254 /a255 ] def

/GlyphProcs 257 dict begin
/.notdef {250 0 d0} bind def
/a0  { save
0 0 0 0 164 170 d1
164 0 0 170 0 0 cm
_op? setoverprint
BI
/Width 164 /Height 170 /BitsPerComponent 1 /Decode [0 1]
/ImageMask true
/DataSource
{{<~J,fQLzzz!!!!@n:1K=zzzz+7Od\zzzz!$C]\zzzz!!",1J,fQLzzz!!!!@n:1K=z

[lots of stuff defining a bit-map]

} _i get /_i _i 1 add 2 mod def} /_i 0 ID  EI
restore } bind def

(the above for glyph "a0") followed by similar for each glyph "a1",
"a2", ... , "a255" in Font "N9". And so on for other fonts.

Then, when it comes to the text on the page, this is referenced as
octal codes like (excerpt from page 3):

/N9 1 Tf
(|\342;\315\360\323\223\300l\314l\302\214\352\371\277\\\304\371\
\325\326\316\\370\315\233\325\016\277;\314\233\322v\300)Tj
/N272 1 Tf
(=<)Tj 
/N9 1 Tf
(\231\300m\300\233\317\036\301\251\317\371\300K\301\330\320D\312\
\313\277\265\\300|\353)Tj
2004 0 Td (\301v\325\326\312a\320)Tj 
221 0 Td
(\314\303\325\206\315\242\316\275\317;\314\330\312\313\316\223\
\317\265\340)Tj 
/N272 1 Tf
(><)Tj 
/N9 1 Tf
(\265\302\230\314l\317;\315)Tj 
831 0 Td
(\277\371\300l\325a\316\223\3174\325\313\315\242\316\223\300\233\
\317\343\315\\242\300)Tj 

illustrating switching from user-defined fonts

  N9->N272->N9->N272->N9

Having had a good look at all this, I'm coming to the conclusion that
Maura's ambition to be able to extract ASCII text (which is what
it looks like when displayed on screen) by "select&paste" is probably
hopeless.

First of all, the encoding vector ("\a0 \a1 ...") makes no reference
to ASCII characters and the bit-map glyphs are internally defined,
so as far as acroread is concerned ASCII has nothing to do with it.

This is confirmed by the fact that when the file is displayed in
acroread, acroread is unable to locate ASCII text using the "Find"
resource. For example, the word "data" occurs many times in the
file. But searching for "data" (case-insensitive) finds only the
single instance in the Figure mentioned above.

Secondly, an attempt to select&paste into a text file and reading
out the bytes as hex, comparing these with the sequence of ASCII
characters displayed on screen, did not seem to establish a definite
correspondence between the bytes and the characters.

Indeed, when the mouse is used to select text on-screen in acroread,
it appears visually that the text is not consecutively selected but
in blocks which jump about on the page, suggesting that the blocks
of code in the PS file may correspond to blocks which are positioned
in the right places in screen but in a sort of "random" order.

Thirdly, in the absence of establishing a correspondence between
the octal codes and ASCII characters, and ditto for the encoding
vector, there is no way of knowing what corresponds to what except
by extracting the glyph definitions (all 255 of them for each font)
and rendering them from an artificial PS file. This would mean
enormous work unless there were some application which is tailor
made for the job (and I never heard of one!).

Finally, the task of re-writing (even automatically using a custom
written program which would have to avoid treading on things it
shouldn't tread on) the whole of an 85-MB PS file is itself enormous!

QUESTIONS:

However, despite all that, I'd like to appeal to any true PS/PDF
gurus out there who can recognise the above and are familiar with
the situation, for any clarification about how the problem can
be approached -- or for confirmation that it really is out of reach.

Another question I'd like to put to anyone who may have encountered
this kind of thing before is:

  What software could possibly have created this in the first place???

Acroread's "Document Info" says:

  Creator: Not Available
  Producer: Aladdin Ghostscript TESTER RELEASE 6.23

but that's only the end of the line. Lurking in the PS version one
can find, as well as the user-defined Type 3 fonts used for the
English text, the following mentions (but no usage):

/HeiseiMin-W3-83pv-RKSJ-H _pdfFontStatus
/HeiseiMin-W3 _pdfCIDFontStatus
/HeiseiKakuGo-W5-83pv-RKSJ-H _pdfFontStatus
/HeiseiKakuGo-W5 _pdfCIDFontStatus
/HeiseiMaruGo-W4-83pv-RKSJ-H _pdfFontStatus
/HeiseiMaruGo-W4 _pdfCIDFontStatus

suggesting Asian capability. Also, Bagheri himself is of Iranian
origin, so may have been created the document using software
with capabilities adapted to that part of the world. So this may
be a clue.

This problem is very tantalising -- theoretically, it could be
solved if a suitable "key" could be spotted, and a suitable
"translation" program composed. In practice, it seems to be
wrapped in obscurity and complexity.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) 
Fax-to-email: +44 (0)870 094 0861
Date: 29-Aug-05                                       Time: 13:54:54
------------------------------ XFMail ------------------------------

Re: [SLE] PDF to TXT (ascii) ... helppppppp

Ted.Harding＠nessie.mcc.ac.uk