I need to transform PDF to text (ascii) ... How can I do that ? I opened a pdf file with acroread and selected all (just the 1st page) and clicked on copy in the Edit drop-down menu. Then I tried to paste into a OpenOffice or a StarOffice text document but in both cases the message was : "the requested clipboard is not available" and nothing got pasted even by selecting an HTML document ... I found the tool "pdf2ps" but it's not very useful as I need to transform PDF to text (ascii) ... Thank you for any suggestion. Maura
On Sun, 28 Aug 2005 14:22, Maura Edelweiss Monville wrote:
I need to transform PDF to text (ascii) ... How can I do that ?
What you need to use is kword, it will import pdf files and allow you to edit the text. -- Regards, Graham Smith
Graham Smith wrote:
On Sun, 28 Aug 2005 14:22, Maura Edelweiss Monville wrote:
I need to transform PDF to text (ascii) ... How can I do that ?
What you need to use is kword, it will import pdf files and allow you to edit the text.
kword allows me to import the PDF file but it is then displayed in some alphabet that looks like Egyptian or Chinese to me. I tried to change the font after selecting some portion of the file but the resul is still unreadable. Even a tool that translates from PDF to HTML wouled still work for me ...but I do not know where to find it. Thank you. Best regards, Maura
On Sun August 28 2005 6:53 am, Maura Edelweiss Monville wrote:
Even a tool that translates from PDF to HTML wouled still work for me ...but I do not know where to find it.
http://linuxcommand.org/man_pages/pdftotext1.html NAME pdftotext - Portable Document Format (PDF) to text converter (version 3.00) DESCRIPTION Pdftotext converts Portable Document Format (PDF) files to plain text. here is the RPM to install it: http://rpmfind.net/linux/RPM/suse/9.1/i386/suse/i586/xpdf-3.00-61.i586.html -- Paul Cartwright Registered Linux user # 367800 X-Request-PGP: http://home.comcast.net/~p.cartwright/wsb/key.asc
Paul Cartwright wrote:
On Sun August 28 2005 6:53 am, Maura Edelweiss Monville wrote:
Even a tool that translates from PDF to HTML wouled still work for me ...but I do not know where to find it.
http://linuxcommand.org/man_pages/pdftotext1.html
NAME
pdftotext - Portable Document Format (PDF) to text converter (version 3.00)
DESCRIPTION
Pdftotext converts Portable Document Format (PDF) files to plain text.
here is the RPM to install it: http://rpmfind.net/linux/RPM/suse/9.1/i386/suse/i586/xpdf-3.00-61.i586.html
The above website is to download xpdf. xpdf comes with the SuSE distribution kit. I installed it from there but it does not peform the translation to txt. maura
Paul Cartwright wrote:
On Sun August 28 2005 6:53 am, Maura Edelweiss Monville wrote:
Even a tool that translates from PDF to HTML wouled still work for me ...but I do not know where to find it.
http://linuxcommand.org/man_pages/pdftotext1.html
NAME
pdftotext - Portable Document Format (PDF) to text converter (version 3.00)
DESCRIPTION
Pdftotext converts Portable Document Format (PDF) files to plain text.
here is the RPM to install it: http://rpmfind.net/linux/RPM/suse/9.1/i386/suse/i586/xpdf-3.00-61.i586.html
I found the tool "pdftotext" but the resultant file is unreadable. Maura
Hello,
In the Message: [suse-linux-e ML: No.245641] with the date of Sun, 28 Aug 2005 07:58:11 -0400 [Maura] == Maura Edelweiss Monville
has written:
Maura> I found the tool "pdftotext" but the resultant file is unreadable. How did you use it? I did as follows; # pdftotext -enc EUC-JP article9.pdf article9.txt and I've got a Japanese text file. --- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp "Bill! You married with Computers. Not with Me!" "No..., with money."
Masaru Nomiya wrote:
Hello,
In the Message: [suse-linux-e ML: No.245641] with the date of Sun, 28 Aug 2005 07:58:11 -0400 [Maura] == Maura Edelweiss Monville
has written: Maura> I found the tool "pdftotext" but the resultant file is unreadable.
How did you use it?
I did as follows;
# pdftotext -enc EUC-JP article9.pdf article9.txt
and I've got a Japanese text file.
--- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp
"Bill! You married with Computers. Not with Me!" "No..., with money."
I did read the man pages for tthe pdftotext command. But there is no
explanation about what to eneter for the critical parameter:
-enc <string> : output text encoding name
WHat are the possible <string> choices ????
What would be a sensible choice to get just a plain readable English
text ?
Thank you,
Hello,
In the Message: [suse-linux-e ML: No.245666] with the date of Sun, 28 Aug 2005 13:27:42 -0400 [Muara] == Maura Edelweiss Monville
has written:
Muara> -enc <string> : output text encoding name Muara> WHat are the possible <string> choices ???? <string> means charset. In Japanese, there exist 3 charsets, EUC-JP, Shift-JIS, ans ISO2022-JP. Muara> What would be a sensible choice to get just a plain readable English Muara> text ? I downloaded an English pdf file, and just executetd # pdftotext profile.pdf then I got a plain text file. I also did # pdftotext -enc UTF-8 profile.pdf this gave me a same result. What's the matter, I wonder? Could you show the result of the below operation; # pdfinfo foo.pdf and # pdffonts foo.pdf Regards, --- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp "No Windows, no gains!" ... "Why, I am wrong?" -- Bill --
Masaru, Maura, On Sunday 28 August 2005 19:19, Masaru Nomiya wrote:
Hello,
...
Muara> What would be a sensible choice to get just a plain readable Muara> English text ?
I downloaded an English pdf file, and just executetd
# pdftotext profile.pdf
then I got a plain text file.
I frequently encounter the symptoms Maura describes. As I said, I've never taken the time to investigate what the issue is.
I also did
# pdftotext -enc UTF-8 profile.pdf
this gave me a same result. What's the matter, I wonder?
Could you show the result of the below operation;
# pdfinfo foo.pdf
and
# pdffonts foo.pdf
I tried these commands a couple of PDF files, but nothing in their output is indicative of the encoding used for the text in the file. On the other hand, if you use the Document Properties command in the File menu of Adobe Reader 7 and view the Fonts tab, you can see encodings. Of the few files I looked at, most of the encodings were listed as "Ansi". Another I saw as "Identity-H".
Regards,
--- Masaru Nomiya
Randall Schulz
this is developing into a hyperthread :-) .... I've just noticed that Acrobat reader 7, whose virtues I was extolling yesterday, has a save as text menu option, which converts the pdf into a text file. This should do for what you requested, Maura. FX
FX Fraipont wrote:
this is developing into a hyperthread :-) ....
I've just noticed that Acrobat reader 7, whose virtues I was extolling yesterday, has a save as text menu option, which converts the pdf into a text file. This should do for what you requested, Maura.
FX
as I said two days ago :-) jdd -- pour m'écrire, aller sur: http://www.dodin.net http://valerie.dodin.net http://arvamip.free.fr
Maura, On Sunday 28 August 2005 03:53, Maura Edelweiss Monville wrote:
...
kword allows me to import the PDF file but it is then displayed in some alphabet that looks like Egyptian or Chinese to me. I tried to change the font after selecting some portion of the file but the resul is still unreadable.
I've seen this symptom with certain PDF files even when copying text from Adobe Reader directly to the clipboard. I've never fully diagnosed it, but it seems like it's some kind of encoding mix-up.
Even a tool that translates from PDF to HTML wouled still work for me ...but I do not know where to find it.
Try this: - Open a shell window. - type "pdf" followed immediately by CTRL-ALT-8 (or, more to the point, CTRL-ALT-*) You'll see a list of all the commands in your PATH that begin with "pdf". On my system, that's: % ll $( type -pf pdf2dsc pdf2ps pdfcrop pdfcslatex pdfcsplain pdfetex pdffonts pdfimages pdfinfo pdfjadetex pdflatex pdfopt pdftex pdftk pdftohtml pdftoppm pdftops pdftotext pdfxmltex pdfxtex ) -rwxr-xr-x 1 root root 518 2005-05-04 14:19 /usr/bin/pdf2dsc* -rwxr-xr-x 1 root root 723 2005-05-04 14:19 /usr/bin/pdf2ps* lrwxrwxrwx 1 root root 43 2005-04-17 10:48 /usr/bin/pdfcrop -> ../share/texmf/teTeX/bin/i586-linux/pdfcrop* -rwxr-xr-x 1 root root 624 2005-04-20 06:35 /usr/bin/pdfcslatex* -rwxr-xr-x 1 root root 624 2005-04-20 06:35 /usr/bin/pdfcsplain* lrwxrwxrwx 1 root root 45 2005-04-20 06:35 /usr/bin/pdfetex -> /usr/share/texmf/teTeX/bin/i586-linux/pdfetex* -rwxr-xr-x 1 root root 628052 2005-03-22 10:52 /usr/bin/pdffonts* -rwxr-xr-x 1 root root 628140 2005-03-22 10:52 /usr/bin/pdfimages* -rwxr-xr-x 1 root root 629108 2005-03-22 10:52 /usr/bin/pdfinfo* lrwxrwxrwx 1 root root 48 2005-04-20 06:35 /usr/bin/pdfjadetex -> /usr/share/texmf/teTeX/bin/i586-linux/pdfjadetex* lrwxrwxrwx 1 root root 46 2005-04-20 06:35 /usr/bin/pdflatex -> /usr/share/texmf/teTeX/bin/i586-linux/pdflatex* -rwxr-xr-x 1 root root 364 2005-05-04 14:19 /usr/bin/pdfopt* lrwxrwxrwx 1 root root 44 2005-04-20 06:35 /usr/bin/pdftex -> /usr/share/texmf/teTeX/bin/i586-linux/pdftex* -rwxr-xr-x 1 root root 2773912 2005-03-19 12:00 /usr/bin/pdftk* -rwxr-xr-x 1 root root 636396 2005-03-19 11:54 /usr/bin/pdftohtml* -rwxr-xr-x 1 root root 746884 2005-03-22 10:52 /usr/bin/pdftoppm* -rwxr-xr-x 1 root root 685164 2005-03-22 10:52 /usr/bin/pdftops* -rwxr-xr-x 1 root root 670944 2005-03-22 10:52 /usr/bin/pdftotext* lrwxrwxrwx 1 root root 47 2005-04-20 06:35 /usr/bin/pdfxmltex -> /usr/share/texmf/teTeX/bin/i586-linux/pdfxmltex* lrwxrwxrwx 1 root root 43 2005-04-17 10:48 /usr/bin/pdfxtex -> ../share/texmf/teTeX/bin/i586-linux/pdfxtex* So you see, there are many options, including HTML conversion.
Thank you. Best regards, Maura
Randall Schulz
Maura Edelweiss Monville wrote:
Graham Smith wrote:
On Sun, 28 Aug 2005 14:22, Maura Edelweiss Monville wrote:
I need to transform PDF to text (ascii) ... How can I do that ?
if you open it with Acrobat reader 7 (available as RPM here:http://ardownload.adobe.com/pub/adobe/reader/unix/7x/7.0/enu/AdobeReader_enu...) you can select the text you want, or select all if you like.
You can then paste it into OpenOffice. If you get the 'selected clipboard format not available' message, click on the Klipper icon in the right-hand corner of the taskbar, and select the clipboard content you want. Then paste. This is by far the easiest solution FX
On Sun August 28 2005 4:12 pm, FX Fraipont wrote:
if you open it with Acrobat reader 7 (available as RPM here:http://ardownload.adobe.com/pub/adobe/reader/unix/7x/7.0/enu/AdobeRead er_enu-7.0.0-2.i386.rpm) you can select the text you want, or select all if you like.
You can then paste it into OpenOffice. If you get the 'selected clipboard format not available' message, click on the Klipper icon in the right-hand corner of the taskbar, and select the clipboard content you want. Then paste.
This is by far the easiest solution
FX
FX, perhaps this is a silly question, but in doing it as you recommend is it possible to preserve any of the .pdf formatting? I tried this with a bank statement and the result was a long OpenOffice document with everything jammed against the left margin. All of the columns of numbers that displayed so nicely in the .pdf were dismembered. The resulting text document was unusable. Is there a way to copy the .pdf into text yet preserve the document so it can be read? If not, what's the use of converting the .pdf into a text document? Thanks! ;o) Gil
On Mon, 29 Aug 2005 05:18, Gil Weber wrote:
FX, perhaps this is a silly question, but in doing it as you recommend is it possible to preserve any of the .pdf formatting? I tried this with a bank statement and the result was a long OpenOffice document with everything jammed against the left margin. All of the columns of numbers that displayed so nicely in the .pdf were dismembered.
The resulting text document was unusable.
Is there a way to copy the .pdf into text yet preserve the document so it can be read? If not, what's the use of converting the .pdf into a text document?
As I mentioned earlier try opening the pdf with Kword, the formatting will be retained to a fair degree depending how much graphics is in the file. -- Regards, Graham Smith
Graham Smith wrote:
On Mon, 29 Aug 2005 05:18, Gil Weber wrote:
FX, perhaps this is a silly question, but in doing it as you recommend is it possible to preserve any of the .pdf formatting? I tried this with a bank statement and the result was a long OpenOffice document with everything jammed against the left margin. All of the columns of numbers that displayed so nicely in the .pdf were dismembered.
The resulting text document was unusable.
Is there a way to copy the .pdf into text yet preserve the document so it can be read? If not, what's the use of converting the .pdf into a text document?
As I mentioned earlier try opening the pdf with Kword, the formatting will be retained to a fair degree depending how much graphics is in the file.
The following is a chunk of what I get by opening or importing the pdf file into kword: ÐgfihFjWkmlmh?nDopkpqsrfdjtnDjtfiu#h?kmh@kvlxw#yRc©Î?ÏúÁÃÂ?Ù?À?ÂsÊ?ÁcÓ?ÀlÌlÂ?Ïúë0ÁÃÂ? À#cCÍðÁ©Ó?Ó½ÀlÍ¢ÊËÎ?ÒvÀqë?À?ÂËÀ?ϸÊêµÂË Regards, maura
Gil Weber wrote:
FX, perhaps this is a silly question, but in doing it as you recommend is it possible to preserve any of the .pdf formatting? I tried this with a bank statement and the result was a long OpenOffice document with everything jammed against the left margin. All of the columns of numbers that displayed so nicely in the .pdf were dismembered.
The resulting text document was unusable.
You don't preserve all of the formatting, but to say that it is unusable is certainly an exaggeration. I have just tried it on an official examination paper questionnaire, and you lose the graphic at the beginning of the paper and you'd have to add line seaparators and tabs, but it is certainly usable to me. By the way, Maura's original request was "txt - ascii" I have then tried Graham's suggestion of using Kword to open the pdf, and I must concede defeat :-) : kword keeps most of the formatting. the only problem was that the original graphic was turned upside down, which can be easily fixed. fx
On Mon August 29 2005 3:30 am, FX Fraipont wrote:
Gil Weber wrote:
FX, perhaps this is a silly question, but in doing it as you recommend is it possible to preserve any of the .pdf formatting? I tried this with a bank statement and the result was a long OpenOffice document with everything jammed against the left margin. All of the columns of numbers that displayed so nicely in the .pdf were dismembered.
The resulting text document was unusable.
You don't preserve all of the formatting, but to say that it is unusable is certainly an exaggeration. I have just tried it on an official examination paper questionnaire, and you lose the graphic at the beginning of the paper and you'd have to add line seaparators and tabs, but it is certainly usable to me. By the way, Maura's original request was "txt - ascii"
I guess "unusable" is in the eyes of the beholder. ;o) I used the "save as text" command in Adobe Reader 7. Well, my bank statement has 9 columns (headings) with numbers below each heading. All 9 columns end up one on top of the other against the left margin. The only way to use the document is to reconstruct it using line separators and tabs (as you suggested). But the amount of work on a document such as this bank statement is beyond consideration. Perhaps line separators and tabs might be considered for a .pdf that contained only text in paragraph form (i.e., a document created using MS Word). That I can see might be fairly easy to reconstruct. I was hoping that there might be a relatively quick and easy way to do it for a document made up mostly of columns and rows. Perhaps there is not. Thanks. I appreciate your thoughts and suggestions. ;o) Gil
Maura Edelweiss Monville wrote:
I need to transform PDF to text (ascii) ... How can I do that ? I opened a pdf file with acroread and selected all (just the 1st page) and clicked on copy in the Edit drop-down menu. Then I tried to paste into a OpenOffice or a StarOffice text document but in both cases the message was : "the requested clipboard is not available" and nothing got pasted even by selecting an HTML document ...
Which version of Open Office (and Suse) are you using? In the 2.0 beta version that came with 9.3 this works without problems. Alternatively (there always is another way) you could try to use a text editor like kate. That should also work. Regards, -- Jos van Kan www.josvankan.tk
Jos van Kan wrote:
Maura Edelweiss Monville wrote:
I need to transform PDF to text (ascii) ... How can I do that ? I opened a pdf file with acroread and selected all (just the 1st page) and clicked on copy in the Edit drop-down menu. Then I tried to paste into a OpenOffice or a StarOffice text document but in both cases the message was : "the requested clipboard is not available" and nothing got pasted even by selecting an HTML document ...
Which version of Open Office (and Suse) are you using? In the 2.0 beta version that came with 9.3 this works without problems. Alternatively (there always is another way) you could try to use a text editor like kate. That should also work.
Regards,
Again, I can paste into kate but the result is "Chinese" or "Arabic" ... a sequcnce of symbols and characters. I run SuSE 9.1 with OpenOffice 1.1 I do not have the time change the operating system every 6 months. Maura
Maura Edelweiss Monville wrote:
Again, I can paste into kate but the result is "Chinese" or "Arabic" ... a sequcnce of symbols and characters.
Tools>Coding lets you change from Japanese to Western Regards, -- Jos van Kan www.josvankan.tk
Hello,
In the Message: [suse-linux-e ML: No.245618] with the date of Sun, 28 Aug 2005 00:22:46 -0400 [Maura] == Maura Edelweiss Monville
has written:
Maura> I found the tool "pdf2ps" but it's not very useful as I need to Maura> transform PDF to text (ascii) ... There exists the tool "pdftotext". --- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp "Bill! You married with Computers. Not with Me!" "No..., with money."
I just installed acrobat reader 7 rpm from adobe, it can save as txt on the download page, don't choose "unix" but "others" (that is linux for adobe :-) jdd -- pour m'écrire, aller sur: http://www.dodin.net http://valerie.dodin.net http://arvamip.free.fr
Masaru Nomiya wrote:
Hello,
In the Message: [suse-linux-e ML: No.245618] with the date of Sun, 28 Aug 2005 00:22:46 -0400 [Maura] == Maura Edelweiss Monville
has written: Maura> I found the tool "pdf2ps" but it's not very useful as I need to Maura> transform PDF to text (ascii) ...
There exists the tool "pdftotext".
Where ? I tried to type it on the comman line but the system cannot recognize it. Thank you, maura
--- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp
"Bill! You married with Computers. Not with Me!" "No..., with money."
On Sunday 28 August 2005 07:06 am, Maura Edelweiss Monville wrote:
Masaru Nomiya wrote:
Hello,
>In the Message: [suse-linux-e ML: No.245618] > with the date of Sun, 28 Aug 2005 00:22:46 -0400 >[Maura] == Maura Edelweiss Monville
> has written: Maura> I found the tool "pdf2ps" but it's not very useful as I need to Maura> transform PDF to text (ascii) ...
There exists the tool "pdftotext".
Where ? I tried to type it on the comman line but the system cannot recognize it.
Thank you, maura
--- Masaru Nomiya mail-to: nomiyac360@mg.point.ne.jp ===========
Maura, If all else fails, have you tried the pdf2ps conversion then loading the ps file into OOo or koffice to save the ps file to text? I, like you, fail to find a pdftotext program. regards, Lee -- --- KMail v1.8.2 --- SuSE Linux Pro v9.2 --- Registered Linux User #225206 There's no problem so awful that you can't add some guilt to it and make it even worse! ...Calvin & Hobbes
On Sunday 28 August 2005 10:30 am, BandiPat wrote:
Maura, If all else fails, have you tried the pdf2ps conversion then loading the ps file into OOo or koffice to save the ps file to text? I, like you, fail to find a pdftotext program.
regards, Lee
I retract that last sentence, as the pdftotext is installed. As someone else pointed out, it appears to be part of the xpdf files. Lee
Hi, On Sun, 28 Aug 2005 00:22:46 -0400 Maura Edelweiss Monville <.> wrote:
I need to transform PDF to text (ascii) ... How can I do that ?
Just a remark. We have plenty of pdfs, which could be converted to plain text exclusively through opt. character recognition (~OCR)! First we also tried to get out the text only, but the trick is, that all of the pages in these pdfs are inserted as tiff files! Only human eyes recognize the bla-bla as _text_, for a computer they stay only _images_... Pelibali
Pelibali, On Sunday 28 August 2005 03:19, pelibali wrote:
Hi,
On Sun, 28 Aug 2005 00:22:46 -0400
Maura Edelweiss Monville <.> wrote:
I need to transform PDF to text (ascii) ... How can I do that ?
Just a remark. We have plenty of pdfs, which could be converted to plain text exclusively through opt. character recognition (~OCR)! First we also tried to get out the text only, but the trick is, that all of the pages in these pdfs are inserted as tiff files! Only human eyes recognize the bla-bla as _text_, for a computer they stay only _images_...
ACM digitized its library this way, but they included the OCR-ed text _and_ the scanned page images in the PDF files they distribute. When you read or print the document, you see the scanned page images. When you copy or search, the OCR-ed text is used. The OCR is predictably flawed, but the scheme is about the best you can hope for with fully automated digitization of a very large library.
Pelibali
Randall Schulz
participants (11)
-
BandiPat
-
FX Fraipont
-
Gil Weber
-
Graham Smith
-
jdd sur free
-
Jos van Kan
-
Masaru Nomiya
-
Maura Edelweiss Monville
-
Paul Cartwright
-
pelibali
-
Randall R Schulz