I think an open-source auto tagger for PDFs is very possible.... - Random

mush42, 6 months ago

I think an open-source auto tagger for PDFs is very possible.
It will make it easier to convert PDFs to highly structured HTML documents.

Anyone interested in tackling this challenge with me?

Adobe already took the lead:
https://news.adobe.com/news/news-details/2023/Media-Alert-Adobe-Scales-PDF-Accessibility-With-Adobe-Sensei-AI/default.aspx

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ devinprater

Image

Image alternative text

scruss, 6 months ago

@mush42 Knowing Adobe, it will be for a very limited subset of PDFs produced by their own software. Never trust a company that has two incompatible standards for managing form data ...

(I used to be a prepress programmer, so I've experienced a whole load of really terrible PDFs. At best, they're digital marks on paper. I also remember the whole "tagged PDF" thing from the early 2000s)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mush42, 6 months ago

@scruss
For this project, I'd forgo parsing the PDF stream, and extract symantic structure using a visual rendition. Then I'd use this symantic metadata to parse the PDF stream and extract text.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ devinprater

Add comment