Recurring #TechWriting issue that I still haven't found a good solution for:

Is anyone aware of a decently reliable automation for reformatting #Markdown text that previously used line length limits of 80 characters and forced line wraps, to one sentence per line?

Must preserve all Markdown formatting including tables and fenced code blocks.

(If you think this is trivial and can be solved with a sprinkling of regex — nope.)

Boosts appreciated!

@xahteiwi Best guess: Convert to HTML and back.

Identifying sentences remains a hard problem, but the rest should be mostly mechanical. I'd start with Pandoc, hoping that it can be configured to create Markdown in the required format.

@OmegaPolice No need for the HTML conversion if using pandoc — you can use `--wrap=none` to remove line breaks, even if you're staying within Markdown. However, sadly that doesn't solve the problem at all, because now you have lengthy paragraphs of multiple sentences.

@xahteiwi Ah, nice! 👍

That's probably the best you can get without throwing some serious NLP at it. Curious to see if I'm missing something!

@OmegaPolice I'm slowly coming to the conclusion that this style decision is a one-way function: if you write your original documents as one sentence per line, it is trivial to subsequently impose a line length limit. But once you have that limit, then unless you also mandate *sentences* of, say <80 characters (not sure if that's ever useful; I doubt it), it's quite painful to go to one sentence per line.

@xahteiwi @OmegaPolice

Boosted because I don't have an answer. TeX has some pretty good heuristics for working out when a . is a full stop that ends a sentence, but they're not 100% reliable. I don't think this is something I'd want to do without carefully reading the output. It's probably better to just define that style for changes and tell people to reformat an entire paragraph when they make a change anywhere.

@david_chisnall That's exactly what I'm doing now, but it's causing bad blood with infrequent contributors. They make a big change and because the rest of the specific Markdown doc they're editing uses 80-char lines, that's how they format their patch, with the best of intentions. Then I ask them to reformat to one sentence per line, which is manual and tedious and they're rightfully annoyed. I want to remove that tedium and annoyance.

@OmegaPolice

@xahteiwi @david_chisnall Hm. 🤔 So if you automatically reformat to one paragraph per line and ask to boyscout to one sentence per line, would that help?

@OmegaPolice @xahteiwi

When I've done this manually, I've done:

  • Bulk reformat one paragraph per line.
  • Search for a dot followed by a space.
  • Replace almost all of those with dot followed by newline.

It's the almost that makes this an annoying manual process.

@david_chisnall Right. And now you also want line breaks after exclamation and question marks. All unless they're enclosed in backticks. And of course not within fenced code blocks. Or tables.

@OmegaPolice

@xahteiwi @david_chisnall Nah, not worth it. Fix-when-touched is probably good enough -- the driving motivator probably is reviewability? Unless you care enough to sit down and do it manually, anyway. 😉

@xahteiwi @david_chisnall @OmegaPolice What about quotes?

> When the terminal says "Error! Run again" you do as it says.

Probably shouldn't be wrapped at all, should it?

Then again, that's just the same as the backticks.

Not convinced this cannot be done with regex...

@jpl Did you read the thread from the start?

@xahteiwi I did. The more precise problem statement in the post I directly replied to, however, sounded compatible with regex.

But I won't bother you anymore, sorry.

@jpl You can't find matching quotes, parens, etc with regex in nested structures. You need to do that for this task, though.

@OmegaPolice Nested like "a 'b "c" d' e"? Or just different styles like "a 'b <c> d' e"?

Also, what rules actually do apply for quoted sentences?

> It says: "Something went wrong. Check the logs!"

Would that be wrapped after "wrong."? Probably, that quote could get really long

Then again

> When the output says "Something went wrong. Check the logs!" you should do as it says.

Probably shouldn't be wrapped.

The rules seem underspecified, so doing it automatically seems impossible.

@xahteiwi @david_chisnall @OmegaPolice Why one sentence per line? I've never heard of this before. It's not part of the basic Markdown convention.
Writing one sentence per line | Derek Sivers

@xahteiwi @david_chisnall @OmegaPolice I think that works much better as a tool than as a requrement.

@grvsmth I genuinely do not understand what you mean by those words, but attacking the premise isn't helpful in this discussion.

@david_chisnall @OmegaPolice

@xahteiwi @david_chisnall @OmegaPolice It wasn't an attack, but since you perceived it as one, you're probably better off getting help from someone else.

@grvsmth "Attacking the premise" is a technical term for expressing disagreement not with the endpoint, but with the starting point of someone's reasoning.

@david_chisnall @OmegaPolice

@grvsmth @xahteiwi @OmegaPolice

It was a LaTeX convention and it has been carried over elsewhere.

The biggest benefit is that most revision control systems (from RCS to git) work really well with line diffs. If you modify a sentence and you have one sentence per line, the diff will show changes in one sentence. You don’t see sentences that haven’t been changed in the diff. If you don’t do this, the diff will contain irrelevant bits of text (and a lot more if you rewrap to an 80-column boundary). This makes review much easier.

This, in turn, makes merge conflicts simpler. If two people edit different sentences, they either won’t have conflicts or will have conflicts that are trivial to merge.

I’ve never had the problem of moving to this because it’s such an obvious benefit that I’ve always started projects using it. The only time I’ve had to reformat things are if other people contribute and don’t follow the style, so the manual thing is not too bad.

@david_chisnall

Clearly you're cleverer than me, because to me the benefit wasn't as immediately obvious. Which got me into this mess. 😁

But I agree completely with the rest of what you wrote.

And in addition, it also helps detect run-on sentences.

@grvsmth @OmegaPolice

@xahteiwi @david_chisnall @grvsmth All of that.

I actually also like to line-break on some commas and semi-colons to keep lines short-ish. Which in English probably means I have overly long sentences; in German it's more legit.

@OmegaPolice I'd argue that German text, too, benefits from fewer run-on sentences. 🙂

@david_chisnall @grvsmth

@xahteiwi @grvsmth @OmegaPolice

The benefit is obvious as soon as you work on a project that collaborates and doesn't do it.

I was fortunate that this happened to me with LaTeX before Markdown was more than a niche thing (so many merge conflicts in a paper six hours before the submission deadline...).

@xahteiwi is this one of those elusive cases in which using an LLM actually is the most sensible solution
@schratze I do not know. Others (including @OmegaPolice in this thread) have pointed out that solving this problem is likely to involve natural-language processing of some sort.
@xahteiwi So this is some kind of "semantic line breaks" (https://sembr.org/). With that term, some tools show up in a search, but I haven't tried any of them (yet). Definitely interested in that topic.
Semantic Line Breaks

When writing text with a compatible markup language, add a line break after each substantial unit of thought.

@xahteiwi emacs's markdown-mode partially renders markdown so pesky in-sentence markups are gone... then setting a high enough fill-width unbreaks the paragraph into a single wrapped line. then select the line, replace ". " to ". LF" and you would have the desired formatting. record a macro of the above and repeat over the paragraphs until done. tried it in a small doc and it seemed to work mostly... code, lists, quote blocks are all skipped properly... because the markdown-mode doesn't render tables, those need to be skipped manually. of course if you don't use emacs, you now have two problems :) (apologies to jwz).
@xahteiwi mark-end-of-sentence lets you hop by sentences so it is safer than blindly replacing ". ".
@kaveman Right; your suggestion is essentially a variant of https://hachyderm.io/@OmegaPolice/114557428773845306 — see my reply there. :)
Raphael (@OmegaPolice@hachyderm.io)

@xahteiwi@mastodon.social Best guess: Convert to HTML and back. Identifying sentences remains a hard problem, but the rest should be mostly mechanical. I'd start with Pandoc, hoping that it can be configured to create Markdown in the required format.

Hachyderm.io