noleli,
@noleli@mastodon.social avatar

I have a 72-page, 36-year-old typewritten document that was scanned to a non-OCRed PDF in 2010. I’m trying to cleanly extract the text so I can convert it to markdown. All attempts at OCR have yielding extremely messy results. Is there a new generation of ML-based OCR I could try, or should I MTurk it?

noleli,
@noleli@mastodon.social avatar

Related…I was initially thinking I’d turn these HOA bylaws into an 11ty site, but now I’m thinking VitePress. Unless there’s an SSG designed for legal documents?

noleli,
@noleli@mastodon.social avatar

Turning legalese from the 1980s into well-structured markup has its challenges. Some numbered paragraphs have titles, making them subsections. But some are just numbered. But the numbering scheme makes clear that they’re intended to be subsections, not a list. So now I’m thinking I could reach for the oft-maligned <section> element, which, in my case, may or may not contain an <h3>, but from a counter’s perspective is still a proper section. Does that seem reasonable?

mdekstrand,
@mdekstrand@hci.social avatar

@noleli sound good to me! (and what problem do people have with <section>?)

noleli,
@noleli@mastodon.social avatar

@mdekstrand people discourage using it because of what it can (or historically did) imply about the levels of the headers inside it. My main take-aways from these posts are that it’s a long and sordid tale, and that it’s still not clear to me what to do when some sections that would have an h3 are missing the heading.

https://www.smashingmagazine.com/2020/01/html5-article-section/

https://adrianroselli.com/2016/08/there-is-no-document-outline-algorithm.html

@brucelawson @aardrian

mdekstrand,
@mdekstrand@hci.social avatar

@noleli @brucelawson @aardrian it looks like the primary problem is with section +expecting a document outline algo that doesn’t exist, not with <section> as a container in combo with properly-nested h1/2/3, except in so far as section may not have a purpose without said algo?

ISTM that your proposed use is closer to the container w/ proper header tags when applicable. Another option could be “<h3 class=untitle>Untitled Section</h3>

mdekstrand,
@mdekstrand@hci.social avatar

@noleli @brucelawson @aardrian I was also curious because I use <section> all the time with proper header tags, just as a nicer alternative to wrapping the section in a div, based on what I learned from MDN and not reading the background discussions. It looks like this probably isn’t creating any of the problems discussed in these pieces? (This is also what Pandoc does if you specify HTML5 output and section-divs.)

noleli,
@noleli@mastodon.social avatar

@mdekstrand @brucelawson @aardrian right, with proper headings it seems to be totally harmless, so your application seems fine. The one thing that threw me off was the discussion at the end of the Smashing article about <section> without headers. Like, would I need to give it an aria-label of “Untitled section”? And would that be better or worse than your suggestion of actual Untitled headers (even if visually hidden)?

aardrian,
@aardrian@toot.cafe avatar

@noleli @mdekstrand

&lt;section&gt; is no different than a &lt;div&gt;, even if it has a heading.

However, if you give it a name via ARIA (aria-label or aria-labelledby), then it becomes a region landmark.

Ref: https://www.w3.org/TR/html-aam-1.0/#el-section

@brucelawson

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • GTA5RPClips
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • tacticalgear
  • cubers
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • osvaldo12
  • ngwrru68w68
  • kavyap
  • InstantRegret
  • JUstTest
  • everett
  • Durango
  • cisconetworking
  • khanakhh
  • ethstaker
  • tester
  • anitta
  • Leos
  • normalnudes
  • modclub
  • megavids
  • provamag3
  • lostlight
  • All magazines