Re: [SLE] PDF to TXT (ascii) ... helppppppp
Maura Edelweiss Monville wrote:
I downloaded Acroread 7.0 from the site you indicated. I still have Acroread 6.0 installed. Shall I uninstall the current version before installing the new one through Yast ? To install the new version is the following command right:
#rpm -Uvh AdobeReader_enu-7.0.0-2.i386.rpm
yes. Uninstalling 6 is not necessary. But wait, I'm trying out the various solutions I proposed on the file you sent, and they are NOT convincing to say the least ... Acrobat reader 7 saved to *.txt > gibberish Acrobat reader 7 copy text paste to Openoffice > gibberish Open pdf in Kword > gibberish pdftk > useless pdf2html > gibberish I've just installed acrobat 5 for windows through crossover wine, and export to txt > gibberish I've then downloaded Acrobar Reader professional tryout version for windows (100mb) and installed it on my son's computer. It opens your pdf fine, but saving as .txt has it choke on page 21. I have used the recommended conversion methos repeatedly in the past, and they have worked fairly well, but I had never used it with a file as complex as the dreaded Bagheri file. Have you considered getting in touch with the author and requesting another format? I'm afraid this is the only thing I can recommend, having unsuccessfully tried all the other options :-( fx
On 29-Aug-05 FX Fraipont wrote:
Maura Edelweiss Monville wrote:
I downloaded Acroread 7.0 from the site you indicated. I still have Acroread 6.0 installed. Shall I uninstall the current version before installing the new one through Yast ? To install the new version is the following command right:
#rpm -Uvh AdobeReader_enu-7.0.0-2.i386.rpm
yes. Uninstalling 6 is not necessary.
But wait, I'm trying out the various solutions I proposed on the file you sent, and they are NOT convincing to say the least ...
Acrobat reader 7 saved to *.txt > gibberish Acrobat reader 7 copy text paste to Openoffice > gibberish Open pdf in Kword > gibberish pdftk > useless pdf2html > gibberish
I've just installed acrobat 5 for windows through crossover wine, and export to txt > gibberish
I've then downloaded Acrobar Reader professional tryout version for windows (100mb) and installed it on my son's computer. It opens your pdf fine, but saving as .txt has it choke on page 21.
I have used the recommended conversion methos repeatedly in the past, and they have worked fairly well, but I had never used it with a file as complex as the dreaded Bagheri file. Have you considered getting in touch with the author and requesting another format?
I'm afraid this is the only thing I can recommend, having unsuccessfully tried all the other options :-(
fx
Maura also (at my request) sent the file to me. (It weighs in at
4.9MB as a 70-page PDF [with some graphics] and when exported from
acroread as a PS file it becomes 85MB).
I have made a serious attempt to analyse the PS file. It's a
monstrous nightmare.
At the end I have a few questions for anyone who may have the
expertise to take the matter further, but it seems very likely
that all of the usual resources (such as many have suggested)
will simply fail.
The basic issue, it seems to me, is that (apart from one tiny patch
in the middle of a Figure which may have been imported from some
other application), the entire text is in bit-mapped glyphs set up
as Type 3 fonts. There is a total of 8 "user-defined" Type 3 fonts.
Each page begins with a set of font definitions. For each font, there
is an encoding vector followed by glyph definitions on the lines of
T3Defs
/FontType 3 def
/FontName /N9 def
/FontMatrix [1
...
]def
/Encoding [/a0 /a1 /a2 /a3 /a4 /a5 /a6 /a7 /a8 /a9 /a10 /a11
/a12 /a13 /a14 /a15 /a16 /a17 /a18 /a19 /a20 /a21 /a22 /a23
/a24 /a25 /a26 /a27 /a28 /a29 /a30 /a31 /a32 /a33 /a34 /a35
...
/a228 /a229 /a230 /a231 /a232 /a233 /a234 /a235 /a236 /a237
/a238 /a239 /a240 /a241 /a242 /a243 /a244 /a245 /a246 /a247
/a248 /a249 /a250 /a251 /a252 /a253 /a254 /a255 ] def
/GlyphProcs 257 dict begin
/.notdef {250 0 d0} bind def
/a0 { save
0 0 0 0 164 170 d1
164 0 0 170 0 0 cm
_op? setoverprint
BI
/Width 164 /Height 170 /BitsPerComponent 1 /Decode [0 1]
/ImageMask true
/DataSource
{{<~J,fQLzzz!!!!@n:1K=zzzz+7Od\zzzz!$C]\zzzz!!",1J,fQLzzz!!!!@n:1K=z
[lots of stuff defining a bit-map]
} _i get /_i _i 1 add 2 mod def} /_i 0 ID EI
restore } bind def
(the above for glyph "a0") followed by similar for each glyph "a1",
"a2", ... , "a255" in Font "N9". And so on for other fonts.
Then, when it comes to the text on the page, this is referenced as
octal codes like (excerpt from page 3):
/N9 1 Tf
(|\342;\315\360\323\223\300l\314l\302\214\352\371\277\\\304\371\
\325\326\316\\370\315\233\325\016\277;\314\233\322v\300)Tj
/N272 1 Tf
(=<)Tj
/N9 1 Tf
(\231\300m\300\233\317\036\301\251\317\371\300K\301\330\320D\312\
\313\277\265\\300|\353)Tj
2004 0 Td (\301v\325\326\312a\320)Tj
221 0 Td
(\314\303\325\206\315\242\316\275\317;\314\330\312\313\316\223\
\317\265\340)Tj
/N272 1 Tf
(><)Tj
/N9 1 Tf
(\265\302\230\314l\317;\315)Tj
831 0 Td
(\277\371\300l\325a\316\223\3174\325\313\315\242\316\223\300\233\
\317\343\315\\242\300)Tj
illustrating switching from user-defined fonts
N9->N272->N9->N272->N9
Having had a good look at all this, I'm coming to the conclusion that
Maura's ambition to be able to extract ASCII text (which is what
it looks like when displayed on screen) by "select&paste" is probably
hopeless.
First of all, the encoding vector ("\a0 \a1 ...") makes no reference
to ASCII characters and the bit-map glyphs are internally defined,
so as far as acroread is concerned ASCII has nothing to do with it.
This is confirmed by the fact that when the file is displayed in
acroread, acroread is unable to locate ASCII text using the "Find"
resource. For example, the word "data" occurs many times in the
file. But searching for "data" (case-insensitive) finds only the
single instance in the Figure mentioned above.
Secondly, an attempt to select&paste into a text file and reading
out the bytes as hex, comparing these with the sequence of ASCII
characters displayed on screen, did not seem to establish a definite
correspondence between the bytes and the characters.
Indeed, when the mouse is used to select text on-screen in acroread,
it appears visually that the text is not consecutively selected but
in blocks which jump about on the page, suggesting that the blocks
of code in the PS file may correspond to blocks which are positioned
in the right places in screen but in a sort of "random" order.
Thirdly, in the absence of establishing a correspondence between
the octal codes and ASCII characters, and ditto for the encoding
vector, there is no way of knowing what corresponds to what except
by extracting the glyph definitions (all 255 of them for each font)
and rendering them from an artificial PS file. This would mean
enormous work unless there were some application which is tailor
made for the job (and I never heard of one!).
Finally, the task of re-writing (even automatically using a custom
written program which would have to avoid treading on things it
shouldn't tread on) the whole of an 85-MB PS file is itself enormous!
QUESTIONS:
However, despite all that, I'd like to appeal to any true PS/PDF
gurus out there who can recognise the above and are familiar with
the situation, for any clarification about how the problem can
be approached -- or for confirmation that it really is out of reach.
Another question I'd like to put to anyone who may have encountered
this kind of thing before is:
What software could possibly have created this in the first place???
Acroread's "Document Info" says:
Creator: Not Available
Producer: Aladdin Ghostscript TESTER RELEASE 6.23
but that's only the end of the line. Lurking in the PS version one
can find, as well as the user-defined Type 3 fonts used for the
English text, the following mentions (but no usage):
/HeiseiMin-W3-83pv-RKSJ-H _pdfFontStatus
/HeiseiMin-W3 _pdfCIDFontStatus
/HeiseiKakuGo-W5-83pv-RKSJ-H _pdfFontStatus
/HeiseiKakuGo-W5 _pdfCIDFontStatus
/HeiseiMaruGo-W4-83pv-RKSJ-H _pdfFontStatus
/HeiseiMaruGo-W4 _pdfCIDFontStatus
suggesting Asian capability. Also, Bagheri himself is of Iranian
origin, so may have been created the document using software
with capabilities adapted to that part of the world. So this may
be a clue.
This problem is very tantalising -- theoretically, it could be
solved if a suitable "key" could be spotted, and a suitable
"translation" program composed. In practice, it seems to be
wrapped in obscurity and complexity.
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding)
(Ted Harding) wrote:
This problem is very tantalising -- theoretically, it could be solved if a suitable "key" could be spotted, and a suitable
very good job. This turn out to be a very intersting thing, but too hard for me :-) however, acroread displays a viewable result. Do th zoom show pixels or not, that is letters are they drawings (bitmaps) of scaled fonts? at least ocr coud do the job :-) jdd -- pour m'écrire, aller sur: http://www.dodin.net http://valerie.dodin.net http://arvamip.free.fr
On 29-Aug-05 jdd sur free wrote:
(Ted Harding) wrote:
This problem is very tantalising -- theoretically, it could be solved if a suitable "key" could be spotted, and a suitable
very good job. This turn out to be a very intersting thing, but too hard for me :-)
however, acroread displays a viewable result. Do th zoom show pixels or not, that is letters are they drawings (bitmaps) of scaled fonts?
at least ocr coud do the job :-)
jdd
Yes, they are bit-maps. If (in acroread) you zoom up to the maximum
(1600%) in stages you can see the little blocks which make up the
edges growing in size until they are quite sizeable squares.
The one expecption is the little but of text in Figure E1 on page 63
of the document. As I pointed out earlier, this text is "searchable"
(e.g. the word "Data" can be found with acroread's "Find", though it
is the only one of several occurrences of the word which is found);
and, if you zoom on this particular but of text, you will see that
the outlines remain smooth at every magnification (to within screen
resolution), showing that in this case the fonts are "outline"
fonts and not bit-maps.
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding)
(Ted Harding) wrote:
Yes, they are bit-maps.
ok so there no other way than OCR. do you know of any ocr reading directly pdf's? if not, one may use ghostscript to do pdf to png conversion, or is nothing more works do a dump of the screen. jdd -- pour m'écrire, aller sur: http://www.dodin.net http://valerie.dodin.net http://arvamip.free.fr
Hello,
In the Message: [suse-linux-e ML: No.245719] with the date of Mon, 29 Aug 2005 13:59:09 +0100 (BST) [Ted] == (Ted Harding)
has written:
[...] Ted> /N9 1 Tf Ted> (|\342;\315\360\323\223\300l\314l\302\214\352\371\277\\\304\371\ Ted> \325\326\316\\370\315\233\325\016\277;\314\233\322v\300)Tj Ted> /N272 1 Tf Ted> (=<)Tj Ted> /N9 1 Tf Ted> (\231\300m\300\233\317\036\301\251\317\371\300K\301\330\320D\312\ Ted> \313\277\265\\300|\353)Tj Ted> 2004 0 Td (\301v\325\326\312a\320)Tj Ted> 221 0 Td Ted> (\314\303\325\206\315\242\316\275\317;\314\330\312\313\316\223\ Ted> \317\265\340)Tj Ted> /N272 1 Tf Ted> (><)Tj Ted> /N9 1 Tf Ted> (\265\302\230\314l\317;\315)Tj Ted> 831 0 Td Ted> (\277\371\300l\325a\316\223\3174\325\313\315\242\316\223\300\233\ Ted> \317\343\315\\242\300)Tj [...] Ted> /HeiseiMin-W3-83pv-RKSJ-H _pdfFontStatus Ted> /HeiseiMin-W3 _pdfCIDFontStatus Ted> /HeiseiKakuGo-W5-83pv-RKSJ-H _pdfFontStatus Ted> /HeiseiKakuGo-W5 _pdfCIDFontStatus Ted> /HeiseiMaruGo-W4-83pv-RKSJ-H _pdfFontStatus Ted> /HeiseiMaruGo-W4 _pdfCIDFontStatus These are Japanese fonts, and may be encoded by Shift-JIS. IMHO, to convert into text file is very difficult, because you must constitute Japanese font system. If Maura send me the pdf file, I will give it try. Regards, --- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp "No Windows, no gains!" ..... "Why, I am wrong?" -- Bill --
On 29-Aug-05 Masaru Nomiya wrote:
Hello,
In the Message: [suse-linux-e ML: No.245719] with the date of Mon, 29 Aug 2005 13:59:09 +0100 (BST) [Ted] == (Ted Harding)
has written: [...] Ted> /HeiseiMin-W3-83pv-RKSJ-H _pdfFontStatus Ted> /HeiseiMin-W3 _pdfCIDFontStatus Ted> /HeiseiKakuGo-W5-83pv-RKSJ-H _pdfFontStatus Ted> /HeiseiKakuGo-W5 _pdfCIDFontStatus Ted> /HeiseiMaruGo-W4-83pv-RKSJ-H _pdfFontStatus Ted> /HeiseiMaruGo-W4 _pdfCIDFontStatus
These are Japanese fonts, and may be encoded by Shift-JIS.
IMHO, to convert into text file is very difficult, because you must constitute Japanese font system. If Maura send me the pdf file, I will give it try.
Hi Masaru,
That is a brave offer! However, please note that in my original
mail (which you excerpted above) I said that these fonts were
*mentioned* (as above) but *not used*. The intention in pointing
this out was in the hope that someone might be able to suggest
what software might have been used to create the original document.
There are no occurrences of Japanese characters in the document.
When the document is displayed, all text is rendered as English
characters (except for mathematical symbols). So there is probably
no point in trying to convert from Japanese to English, since these
seems to be nothing to convert.
The problem definitely lies in the use of Type 3 fonts with a
non-standard encoding vector.
Thanks for the commendable offer of help, though!
Best wishes,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding)
Hello, Sorry for misunderstandig.
In the Message: [suse-linux-e ML: No.245724] with the date of Mon, 29 Aug 2005 15:03:51 +0100 (BST) [Ted] == (Ted Harding)
has written:
Ted> The intention in pointing Ted> this out was in the hope that someone might be able to suggest Ted> what software might have been used to create the original document. Did you use the tool 'pdfinfo"? Ted> There are no occurrences of Japanese characters in the document. Ted> When the document is displayed, all text is rendered as English Ted> characters (except for mathematical symbols). So there is probably Ted> no point in trying to convert from Japanese to English, since these Ted> seems to be nothing to convert. I see. Ted> The problem definitely lies in the use of Type 3 fonts with a Ted> non-standard encoding vector. I think, there doen't exist Type3 fonts for X window. Regards, --- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp "Bill! You married with Computers. Not with Me!" "No..., with money."
After reading Ted's brilliant analysis, suggesting a "key" that might help decode the pdf, I thought, as I had suggested to Maura, why not ask the author for the pdf file in another format? So I decided to check for A. Bagheri on Google, and bingo , A. Bagheri has a homepage somewhere in Norway, where his ( her?) thesis is published in .pdf AND HTML format!!! Maura, here you are: fx (if the energy that was spend answering your query on the SuSe list could be transmutated into oil, one could have heated the whole of Belgium for a full 9 weeks ....) http://folk.uio.no/bagheri/thesis/node1.html#SECTION00100000000000000000 Contents * Contents * Introduction * Statistical Models o Theory o Extraction of Level Density and $\gamma$-ray Strength Function * Methods o Experimental Setup o Spectra Handling + Particle Identification + Coincidence Matrix + Unfolding $\gamma$-Ray Spectra # The Response Function # The Folding Iteration Method # The Compton Subtraction Method + Extraction of First Generation $\gamma$ Spectra * Experimental Results o Level Density o Radiative Strength Function o Thermodynamic Properties * Summary and Conclusion * Nuclear Level Density o The Fermi Gas Model o The Nuclear Shell Model o Some Empirical Level Density Formulas * Transitions between States o Weisskopf estimates o Collective contribution to the transition * Giant Electric Dipole Resonance * Excitation Energy and it's Dependency of Spin and Pairing * Useful Information o The Reaction o Software Structure o Nuclear Deformation and Shape * References * References Preface Nuclear physics have been one of the most fascinating branches in science, which have always been taken my attention and I would place at the top of my ``list of interests''. It concerns fundamental questions of nature and have been the main motivating element for me to take this interesting work. This work has also given me the opportunity to get some view about the most basic and complicated cornerstone of the nature (universe), namely the nucleus. The present work is the result of an experiment on dysprosium metallic foil target with mass number 161. The experiment was performed at the Oslo Cyclotron Laboratory at the University of Oslo in the Autumn 1998 and the data analyzing started at Autumn 2001. The whole work took me about two years. I myself participated in two other experiment, experiment on 94 Mo at June 2002 and 119 Sn at February 2003. The introduction and motivation of the work is presented in Chapter 1. Chapter 2 describes the applied statistical model for extraction of the level density and $\gamma-$ray strength function from the primary $\gamma-$rays in the continuum region. Chapter 3 explains the method and tools used to carry out the experiment and analyzing the data. A detailed discussion on the results of the experiment and comparison with the neighboring isotopes is covered in Chapter 4. Chapter 5 gives a summary and conclusions. I would like to thank my supervisor Magne Guttormsen for his guidance, patience and for many inspiring discussions and mailing. Finally, I would like to thank my family for their support.
On 8/29/05, FX Fraipont
After reading Ted's brilliant analysis, suggesting a "key" that might help decode the pdf, I thought, as I had suggested to Maura, why not ask the author for the pdf file in another format?
So I decided to check for A. Bagheri on Google, and bingo , A. Bagheri has a homepage somewhere in Norway, where his ( her?) thesis is published in .pdf AND HTML format!!!
WOW! now, THAT is service, and the creativity of a community of peers! How cool is that? Peter
On 29-Aug-05 FX Fraipont wrote:
After reading Ted's brilliant analysis, suggesting a "key" that might help decode the pdf, I thought, as I had suggested to Maura, why not ask the author for the pdf file in another format?
So I decided to check for A. Bagheri on Google, and bingo , A. Bagheri has a homepage somewhere in Norway, where his ( her?) thesis is published in .pdf AND HTML format!!!
Maura, here you are:
fx
Brilliant! And the irony is, that I had visited the same website myself without noticing that his thesis was there! I must stop looking through narrow tubes ... Félicitations, FX, de ta profonde simplicité! Géniale!!!
(if the energy that was spend answering your query on the SuSe list could be transmutated into oil, one could have heated the whole of Belgium for a full 9 weeks ....)
Maura: would you ask a similar question in a few months' time?
It's not cold here yet ...
Best wishes to all,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding)
participants (5)
-
FX Fraipont
-
jdd sur free
-
Masaru Nomiya
-
Peter Van Lone
-
Ted.Harding@nessie.mcc.ac.uk