while considering whether i should be looking into replacing astral's ruff for projects i'm responsible for (tl;dr probably not) i realized that code formatting is an excellent place to use a language model. not for developing it, i mean; rather, make the entire formatter be a language model. verify that the AST is semantically equivalent in the end (which is a solved problem for the restricted case we're considering here) and we're golden

while there are people who are fine with the draconian approach taken by tools like black or (to a lesser extent) rustfmt, i find these tools intolerable. they promise consistency but this consistency butchers so much code that i'd rather quarrel with contributors over formatting (and presumably lose some) than read the absolute trash these tools emit in many common cases, with the resolution for this problem being WONTFIX

anyway, semantic style transfer is one of the things CNNs are pretty good at. if i could say "PRs should be formatted 'more or less like this'" and as a result they are formatted 'more or less like this' (with the quality being somewhere in between "manually reformatting all of it by hand" and "let everyone pick whatever they want at all" but closer to the first option), with near-zero per-PR action required from all participants, that would be nice

(you should be able to train a model like this by generating random snippets of code formatted in particular ways)

@whitequark i tried to do something like this at my old job!
checking equivalency post-facto is really brittle, the longer the file the bigger the chance for a single bad sample to fuck it up. either restrict next-token sampling using the checker, or generate sematic-preserving edit actions instead of text tokens.
@G_glop yeah I imagined you'd need to split it somehow but this makes more sense
@whitequark pro tip: do not try to be smart with comments (natural language). hard code some consistent rules and make the user responsible for them after aggressive reformats. (or you'll spend months on it)