Dear @linux@lemmy.ml and @academicchatter@a.gup.pe folks:... - academicchatter

ajayiyer, 17 days ago (edited 17 days ago)

Dear @linux and @academicchatter folks:

Please suggest libre/open source tools that allow for the extraction of text and images from scientific pdf documents?

P.S: I'm on a linux machine. Would like something terminal friendly, if possible!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Image

Image alternative text

Carunga, 17 days ago

Try Zotero. It is a complete literature databas but it’s PDF reader is very good at extracting images and text. Works on all OS, web and mobile. Native Linux client has been very smooth for me. Oh, terminal it doesn’t do though. If you want to extract a large amount in an automated way, its probably not the right tool.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

CCRhode, 17 days ago

I’m mystified that poppler-utils is not a viable option. Of course the *.pdf file would have to include the text itself, but many do.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

impure9435, 17 days ago

@ajayiyer OCRmyPDF is exactly what you are looking for

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Shihali, 17 days ago

gImageReader is a graphical front-end to the open-source OCR program Tesseract, so that might be just what you’re looking for. The default settings don’t add the OCR’d text to the PDF but you can do that.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

seindal, 17 days ago

@ajayiyer @linux @academicchatter https://en.m.wikipedia.org/wiki/Pdftotext

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BaalInvoker, 17 days ago

The first tool I can think of is LibreOffice Draw

Maybe there are other tools, but I think LibreOffice Draw do the job pretty well

Edit: If the PDF has written text, you may wanna use an OCR tool, but I don’t have any to suggest

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment