Dne Po 1. prosince 2014 15:48:17, Carlos E. R. napsal(a):
On 2014-12-01 11:14, Vojtěch Zeisek wrote:
Hi, is there any working tool which is able to add text layer into scanned PDF? I tried YAGF (front-end for cuneiform and/or tesseract), but it seems to have only option to save the text as separate TXT file. Cuneiform also doesn't have this possibility and tesseract I wasn't able to get to work (script OCRmyPDF was always complaining about missing tesseract even it was installed). Scantailor seems to lack this functionality. Ocrad wasn't able to start (and no error message produced) and gocr isn't able to work with PDF... Some old demo version of Vuescan I have requires libgtk-X11 which is unavailable.
You could setup a virtualized guest with an older openSUSE that has the required libraries.
And it is not the cheapest software... Tragedy. Any other suggestions? ;-)
If you ask for ideas... ;-)
Thanks :-P
Personally, I consider PDF a very bad format for scanned documents; I prefer "dejavu", which is designed for that very purpose. It is, however, not popular. There is open software to create the files, and text can be added although I've never tried. However, the available opensource is, let's say, fully functional but clumsy. There is proprietary software that is, they claim, much easier to use.
I'm not author of those PDFs. I use scanned old books from sources like http://biodiversitylibrary.org/ or http://bibdigital.rjb.csic.es/ or even Google Groups or so. Often they already passed through OCR. Sometimes not. And then it is very very useful...
However, OS can be easily scripted...
Yes, convert all my thousands PDFs into dejavu and go on... ;-)
some samples:
djvusmooth - Graphical Text Editor for DjVu pdf2djvu - PDF to DjVu Converter djvu2pdf - Converting Djvu Files to PDF Files djvulibre-doc - Documentation for the the DjVu - djvulibre djvulibre-djview4 - Portable DjVu Qt4 Based Viewer and Browser Plugin
djvutxt - Extract the hidden text from DjVu documents. djvused - Multi-purpose DjVu document editor.
djvulibre - An Open Source Implementation of DjVu
DjVu is a Web-centric format and software platform for distributing documents and images. DjVuLibre is an open source (GPL) implementation of DjVu, including viewers, browser plug-ins, decoders, simple encoders, and utilities. DjVu can advantageously replace PDF, PS, TIFF, JPEG, and GIF for distributing scanned documents, digital documents, or high-resolution pictures. DjVu content downloads faster, displays and renders faster, looks nicer on a screen, and consumes less client resources than competing formats. DjVu images display instantly and can be smoothly zoomed and panned with no lengthy rerendering. DjVu is used by hundreds of academic, commercial, governmental, and noncommercial Web sites around the world.
DjVuDocument
DjVuDocument is a compression technique specifically designed for color digital documents images containing both pictures and text, such as a page of a magazine. DjVuDocument represents images into separately compressed layers. The foreground layer is usually compressed with DjVu Bitonal and contains the text and drawings. The background layer is usually compressed with DjVuPhoto and contains the background texture and the pictures at lower resolution. -- Vojtěch Zeisek
Komunita openSUSE GNU/Linuxu Community of the openSUSE GNU/Linux http://www.opensuse.org/ http://trapa.cz/