mush42,
@mush42@hachyderm.io avatar

I think an open-source auto tagger for PDFs is very possible.
It will make it easier to convert PDFs to highly structured HTML documents.

Anyone interested in tackling this challenge with me?

Adobe already took the lead:
https://news.adobe.com/news/news-details/2023/Media-Alert-Adobe-Scales-PDF-Accessibility-With-Adobe-Sensei-AI/default.aspx

scruss,
@scruss@xoxo.zone avatar

@mush42 Knowing Adobe, it will be for a very limited subset of PDFs produced by their own software. Never trust a company that has two incompatible standards for managing form data ...

(I used to be a prepress programmer, so I've experienced a whole load of really terrible PDFs. At best, they're digital marks on paper. I also remember the whole "tagged PDF" thing from the early 2000s)

mush42,
@mush42@hachyderm.io avatar

@scruss
For this project, I'd forgo parsing the PDF stream, and extract symantic structure using a visual rendition. Then I'd use this symantic metadata to parse the PDF stream and extract text.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • osvaldo12
  • DreamBathrooms
  • InstantRegret
  • magazineikmin
  • Durango
  • Leos
  • Youngstown
  • thenastyranch
  • slotface
  • rosin
  • kavyap
  • mdbf
  • cubers
  • ethstaker
  • anitta
  • khanakhh
  • tacticalgear
  • provamag3
  • ngwrru68w68
  • everett
  • GTA5RPClips
  • modclub
  • normalnudes
  • megavids
  • cisconetworking
  • tester
  • JUstTest
  • lostlight
  • All magazines