scy,
@scy@chaos.social avatar

I need to convert to and I'm looking for a tool to do that.

The output should

• preserve line breaks in paragraphs
• not contain additional, unnecessary linebreaks (e.g. 4 empty lines between paragraphs)
• be configurable (e.g. whether to use * or _ for emphasis, or * vs - for unordered lists)
• if possible, allow me to hook into details (e.g. to convert <pre class="shell"> to ```sh)

or . Alternatively, what's a really configurable prettifier?

:BoostOK:

weberc2,
@weberc2@stranger.social avatar

@scy Probably not a helpful response, but I’m pretty sure HTML is Markdown. The latter is a superset of the former AFAIK.

scy,
@scy@chaos.social avatar

@weberc2 All three statements in your post are correct ;P

Brahn,
@Brahn@hachyderm.io avatar

@scy Seems like https://github.com/matthewwithanm/python-markdownify would give you customization options. I threw this together.

https://gist.github.com/bdmorin/8ef7a9e2082fa7bc90c9878a34b37a59

Hope it gives you some ideas.

Brahn,
@Brahn@hachyderm.io avatar
scy,
@scy@chaos.social avatar

@Brahn Thanks :)

html2text struggles a bit with fenced code blocks though, and it reformats line breaks.

f11xter,
@f11xter@cupoftea.social avatar

@scy I've recently been using dprint for prettifying. It's got plenty of options (though I don't know where your "really configurable" threshold is)

scy,
@scy@chaos.social avatar

@f11xter It has some configuration options, but doesn't allow choosing the character for unordered lists, for example.

dnkrupinski,
@dnkrupinski@hannover.town avatar
scy,
@scy@chaos.social avatar

@dnkrupinski congrats, this is probably the most useless reply in the whole thread

kellyjonbrazil,
@kellyjonbrazil@sfba.social avatar

@scy I'm using the mistune python library to do this for my jc-web project.

https://github.com/kellyjonbrazil/jc-web/blob/master/app.py

scy,
@scy@chaos.social avatar

@kellyjonbrazil Unless I am mistaken, the code you're linking to converts Markdown to HTML, not the other way around.

shochdoerfer,
@shochdoerfer@phpc.social avatar

@scy Back in the days, I had great success converting our old company blog from HTML to Markdown via the league/html-to-markdown Composer package.

I documented my findings here: https://blog.bitexpert.de/blog/silverstripe-to-docusaurus

heiglandreas,
@heiglandreas@phpc.social avatar

@shochdoerfer @scy I'm still using that in one package. Apart from it being reaaaaaaaally slow in most cases that should not be an issue.... 😁

shochdoerfer,
@shochdoerfer@phpc.social avatar

@heiglandreas @scy I don't remember having issues with the performance. The few hundred blog posts got converted quite quickly, I think.

heiglandreas,
@heiglandreas@phpc.social avatar

@shochdoerfer yeah. "Quite quickly" is a rather relative term 😁

We have with multiple hundreds up to thousand entries a delay measuarable in seconds. Just for the conversion. The individual conversion is quite fast and for a one off solution for sure not an issue. But when you do that in sclae it suddenly has an impact... 😕

/cc @scy

mdk,
@mdk@mamot.fr avatar

@scy I haven't looked at the configurability details of html2text, but it converts html to Markdown. Yes the one from Aaron Swartz: https://github.com/Alir3z4/html2text/

scy,
@scy@chaos.social avatar

@mdk Looks interesting, thanks, I'll give it a go.

dunkelstern,
@dunkelstern@kampftoast.de avatar

@scy i think i would try to roll my own with beautifulsoup if i know which HTML would be the input. If the input is unknown… puh tough call.

scy,
@scy@chaos.social avatar

@dunkelstern Is there an easy way to "strip" the whole input document in bs4, like throw away indentation and line breaks? Because markdownify chokes on that. E.g.

<p>
foo
</p>

becomes

foo

in Markdown (with a fucking leading space lol, your Fedi client might not display it)

dunkelstern,
@dunkelstern@kampftoast.de avatar

@scy get_text() has a strip=True parameter which removes whitespace. First parameter of that function defines how to replace linebreaks. If you want a generator you can use .stripped_strings on the tag or parser instance and will get an array of strings

scy,
@scy@chaos.social avatar

@dunkelstern But I'd still need to iterate over the tree and modify elements one by one, right? I was hoping to avoid that.

scy,
@scy@chaos.social avatar

Please do not suggest

• Pandoc (it doesn't allow configuring bullet or emphasis characters)
• Prettier (it doesn't allow configuring basically anything at all)
• the Python "markdownify" package (it has a lot of trouble with indentation and whitespace in the HTML)

daniel_bohrer,
@daniel_bohrer@chaos.social avatar

@scy not a full solution, but if you decide to write something of your own, you could at least use pandoc for the heavy lifting of parsing the HTML, and transform it into an AST, and then use e.g. https://github.com/jgm/pandocfilters to format the AST into a Markdown representation of your liking.

But I hope someone has done that already, even though I cannot find it…

scy,
@scy@chaos.social avatar

@daniel_bohrer Yeah, that's always an option, but an option I'd like to avoid 😬 😉

ki,
@ki@chaos.social avatar

@scy
how about one of these tools + sed or awk?

scy,
@scy@chaos.social avatar

@ki Works for super simple Markdown documents, but just think about having to track whether you're in a code block (and need to keep it untouched) or not

ki,
@ki@chaos.social avatar

@scy
I see... could work with awk but I get that you don't want to do that...
the library version of pandoc provides an abstract syntax tree iirc, but that requires Haskell programming. I'm out of ideas then, sorry

coca,
@coca@chaos.social avatar

@scy Tried Pandoc?

  • All
  • Subscribed
  • Moderated
  • Favorites
  • webdev
  • PowerRangers
  • DreamBathrooms
  • osvaldo12
  • magazineikmin
  • InstantRegret
  • everett
  • Youngstown
  • ngwrru68w68
  • slotface
  • rosin
  • GTA5RPClips
  • tester
  • kavyap
  • thenastyranch
  • provamag3
  • mdbf
  • ethstaker
  • cisconetworking
  • Durango
  • vwfavf
  • normalnudes
  • tacticalgear
  • khanakhh
  • modclub
  • cubers
  • Leos
  • anitta
  • megavids
  • All magazines