Mastodawn

Few strata of geekery are more obsessive than regular-expression geekery. So let’s have some fun! In https://www.tbray.org/ongoing/When/202x/2024/09/22/Unbackslashing I explain why using the usual backslash “\” for escaping is hellishly inconvenient in a current project and propose replacing it with one of «, —, “, ¶, §, or ~. This Friday, I’ll be running some polls tagged #unbackslash to let you all join in.

#software #regex

Unbackslash

ongoing by Tim Bray

Daphne Preston-Kendal Sep 25, 2024

@timbray Precedent: the regexp processor of the MOO programming language used (uses!) % instead of \. I think it got this from somewhere else before it, too, but I don’t know where.

Why are you planning to write your own regexp processor instead of using Go’s? It is good and linear time (although the constant factors aren’t great in comparison to some others). You can even construct an AST from your own regexp syntax and let it do the NFA/DFA conversion and simulation for you

Tim Bray Sep 25, 2024

@dpk As to using Go's regex: My NFA representation is is idiosyncratic, hyper-optimized for raw matching speed, and does much less than Go's. The only result I want is a boolean matches/doesn't-match, so the Go machinery has loads and loads of stuff I don't need.

Janne Moren Sep 25, 2024

@timbray
I love that sed let's you pick the separator for its regexps in a very natural way.

Could you figure out a neat syntax for similarly allowing the user to specify the escape character on a per-expression basis?

Leon Cowle Sep 25, 2024

@timbray Looking forward to the poll. Great question. Love the suggestions. I know which one I'm gonna vote for.

PS <see attached> 😂🤦🏻‍♂️ (nothing wrong there, just some font “fun” that's always frustrating).

Tim Bray Sep 25, 2024

@leoncowle You are right. *sighs*

rob pike Sep 25, 2024

@timbray One improving factor is an idea from Steve Bourne that I vaguely remember implementing in qed. It's a great simplification for escapes: Don't do the doubling. If you see a backslash, count how many there are, and remove only one for this layer of escape. It makes a huge difference in the wieldliness (to coin a term) of escape characters.

Marsh Ray Sep 25, 2024

@robpike @timbray Wow, “qed” !

I never hear that mentioned.

As a small child, that was my first editor.

the roamer Sep 26, 2024

@robpike @timbray

"Wieldliness", that's well coined.

Cameron Hayne Sep 25, 2024

@timbray
1) At first I thought your “regular-depression” was a joke. Now thinking it was an auto-correctism.

2) instead of using a different escape character, wouldn’t it be better to protect the whole regex by sending it as pre-compiled binary? E.g. via a pre-processor mechanism.

Tim Bray Sep 25, 2024

@cameronhayne Gack. Fixed, thks.

John Gruber Sep 25, 2024

@timbray The left guillemet is not hard to type on Mac (or iOS) keyboards: Option-\. It's thus even mnemonically tied to backslash.

But semantically I see « and I want to see a corresponding » (Shift-Option-\). So my spitball idea would be using left and right guillemets to "quote" the escaped special character:

«(»[^«n»«r»)]*«)»

Tim Bray Sep 25, 2024

@gruber I have to say that looks nice. Hmm, using a pair of enclosing markers suggest they could contain more than one character… So you could also have «P{Lu}» rather than «P»{Lu}. Not sure how I feel about that.

John Gruber Sep 25, 2024

@timbray I thought of that too, but didn't want to send you down that rabbit hole. But it's intriguing.

The problem I'm thinking about is that sometime you want to escape a literal character: «(» would mean a literal open paren, but «n» would mean a newline. There aren't many non-literal escapes, though. So, another spitball (could be a truly horrible idea?): what if you keep backslash for non-literal escapes like \n and \r, but use «…» to mean “quote these characters literally”?

John Gruber Sep 25, 2024

@timbray So you could type this to get three consecutive literal open parentheses:

«(((»

or

«\\»

to get 2 literal backslashes. Both of those are zillions more legible than `\(\(\(` or the infamous matchsticks of `\\\\`.

Tim Bray Sep 25, 2024

@gruber All this is compelling, but my library’s users are developers not civilians, and being able to just say “put an X wherever you used to put a \” is attractive. Also I'm kinda over inventing container syntax. Having said that, your idea is visually attractive.

John Gruber Sep 26, 2024

@timbray Trust me, I totally, 100 percent see the appeal of “put an X wherever you used to put a \”.

But my mind starts turning…

Merovius Sep 26, 2024

@timbray You do you, but I would strongly recommend not using a different regexp-syntax.

The fact that vim does that is frustrating. I don't use it often enough to have its quirks memorized, so when I *do* want something specific, I pull my hairs.

Are your users *really* best served by having to look up bespoke syntax to use Quamina? I can guarantee you that I'd much rather have leaning toothpicks, than dealing with such quirks.

oefe Sep 26, 2024

@Merovius @timbray could you support both, backslash and the new character, so that people have the choice?

IBBoard Sep 26, 2024

@timbray Would ¬ work? Easy to access (on a UK keyboard*) and it kinda has semantics - "¬{" means "not (a normal) curly bracket".

* I've not checked American and other keyboards. Given their inferior placing of some other common programming symbols then it may not be available. I'd suggest standardising on UK keyboard to get all the benefits when programming 😉

zellyn Sep 26, 2024

@timbray Some thoughts:
* Of your choices, `~` seems most sane
* `%` is decent
* In `sed`, I tend to use bang: `s!foo/bar!foo/baz!`
* …which supports @jannem's point about letting you pick
* For the particular case of (presumable table-driven) Go unit tests, you could just write a helper that replaces `~` with `\\\\\\\\` and wrap in the test inputs in calls to that function.

ps. I stumbled across https://symbl.cc/en/unicode-table and found it lovely for browsing

Unicode Character Table - Full List of Unicode Symbols (◕‿◕) SYMBL

Explore the complete Unicode characters table on SYMBL (◕‿◕). Find every symbol, emoji, and special character in one place. Perfect for developers, designers, and anyone working with digital text. Browse, search, and discover the full range of Unicode characters effortlessly.

zellyn Sep 26, 2024

@timbray @jannem Completely unserious suggestion: U+0302 — Combining Circumflex Accent — a little roof to protect you from the elements/parser 😂

oefe Sep 26, 2024

@timbray here is a totally different idea to address this: there are programmer’s fonts like #FiraCode that use ligatures to render multi-character sequences as more readable glyphs, e.g. “<=“ as “≤” or “->” as an arrow.

Maybe such a font could also render runs of backslashes with an overlay for their count. I.e four backslashes would be shown as “\\⁴\\“, and eight would be shown as “\\\\⁸\\\\“ (in very crude approximation) #ProgrammersFonts #Ligatures #Backslashes #FontDesign

oefe Sep 26, 2024

@timbray you would still have to type four/eight backslashes for double/triple escaping, but you wouldn’t have to count them. The font could also highlight runs that have a length that is a power of two, so that you immediately see that you have reached that magic number

oefe Sep 26, 2024

@timbray this isn’t meant to replace your proposal — I think I would want both solutions