Mastodawn

Recurring #TechWriting issue that I still haven't found a good solution for:

Is anyone aware of a decently reliable automation for reformatting #Markdown text that previously used line length limits of 80 characters and forced line wraps, to one sentence per line?

Must preserve all Markdown formatting including tables and fenced code blocks.

(If you think this is trivial and can be solved with a sprinkling of regex — nope.)

Boosts appreciated!

Show thread

Raphael May 23

@xahteiwi Best guess: Convert to HTML and back.

Identifying sentences remains a hard problem, but the rest should be mostly mechanical. I'd start with Pandoc, hoping that it can be configured to create Markdown in the required format.

Show thread

Florian Haas May 23

@OmegaPolice No need for the HTML conversion if using pandoc — you can use `--wrap=none` to remove line breaks, even if you're staying within Markdown. However, sadly that doesn't solve the problem at all, because now you have lengthy paragraphs of multiple sentences.

Show thread

Raphael May 23

@xahteiwi Ah, nice! 👍

That's probably the best you can get without throwing some serious NLP at it. Curious to see if I'm missing something!

Show thread

Florian Haas May 23

@OmegaPolice I'm slowly coming to the conclusion that this style decision is a one-way function: if you write your original documents as one sentence per line, it is trivial to subsequently impose a line length limit. But once you have that limit, then unless you also mandate *sentences* of, say <80 characters (not sure if that's ever useful; I doubt it), it's quite painful to go to one sentence per line.

Show thread

David Chisnall (*Now with 50% more sarcasm!*)May 23

@xahteiwi @OmegaPolice

Boosted because I don't have an answer. TeX has some pretty good heuristics for working out when a . is a full stop that ends a sentence, but they're not 100% reliable. I don't think this is something I'd want to do without carefully reading the output. It's probably better to just define that style for changes and tell people to reformat an entire paragraph when they make a change anywhere.

Show thread

Florian Haas May 23

@david_chisnall That's exactly what I'm doing now, but it's causing bad blood with infrequent contributors. They make a big change and because the rest of the specific Markdown doc they're editing uses 80-char lines, that's how they format their patch, with the best of intentions. Then I ask them to reformat to one sentence per line, which is manual and tedious and they're rightfully annoyed. I want to remove that tedium and annoyance.

@OmegaPolice

Show thread

Raphael May 23

@xahteiwi @david_chisnall Hm. 🤔 So if you automatically reformat to one paragraph per line and ask to boyscout to one sentence per line, would that help?

Show thread

David Chisnall (*Now with 50% more sarcasm!*)May 23

@OmegaPolice @xahteiwi

When I've done this manually, I've done:

Bulk reformat one paragraph per line.
Search for a dot followed by a space.
Replace almost all of those with dot followed by newline.

It's the almost that makes this an annoying manual process.

Show thread

Florian Haas May 23

@david_chisnall Right. And now you also want line breaks after exclamation and question marks. All unless they're enclosed in backticks. And of course not within fenced code blocks. Or tables.

@OmegaPolice

Show thread

JPL

@xahteiwi @david_chisnall @OmegaPolice What about quotes?

> When the terminal says "Error! Run again" you do as it says.

Probably shouldn't be wrapped at all, should it?

Then again, that's just the same as the backticks.

Not convinced this cannot be done with regex...

Show thread

Florian Haas May 23

@jpl Did you read the thread from the start?

Show thread

JPL May 23

@xahteiwi I did. The more precise problem statement in the post I directly replied to, however, sounded compatible with regex.

But I won't bother you anymore, sorry.

Show thread

Raphael May 24

@jpl You can't find matching quotes, parens, etc with regex in nested structures. You need to do that for this task, though.

Show thread

JPL May 24

@OmegaPolice Nested like "a 'b "c" d' e"? Or just different styles like "a 'b <c> d' e"?

Also, what rules actually do apply for quoted sentences?

> It says: "Something went wrong. Check the logs!"

Would that be wrapped after "wrong."? Probably, that quote could get really long

Then again

> When the output says "Something went wrong. Check the logs!" you should do as it says.

Probably shouldn't be wrapped.

The rules seem underspecified, so doing it automatically seems impossible.