Few strata of geekery are more obsessive than regular-expression geekery. So let’s have some fun! In https://www.tbray.org/ongoing/When/202x/2024/09/22/Unbackslashing I explain why using the usual backslash “\” for escaping is hellishly inconvenient in a current project and propose replacing it with one of «, —, “, ¶, §, or ~. This Friday, I’ll be running some polls tagged #unbackslash to let you all join in.

#software #regex

Unbackslash

ongoing by Tim Bray

@timbray Precedent: the regexp processor of the MOO programming language used (uses!) % instead of \. I think it got this from somewhere else before it, too, but I don’t know where.

Why are you planning to write your own regexp processor instead of using Go’s? It is good and linear time (although the constant factors aren’t great in comparison to some others). You can even construct an AST from your own regexp syntax and let it do the NFA/DFA conversion and simulation for you

@dpk As to using Go's regex: My NFA representation is is idiosyncratic, hyper-optimized for raw matching speed, and does much less than Go's. The only result I want is a boolean matches/doesn't-match, so the Go machinery has loads and loads of stuff I don't need.

@timbray
I love that sed let's you pick the separator for its regexps in a very natural way.

Could you figure out a neat syntax for similarly allowing the user to specify the escape character on a per-expression basis?

@timbray Looking forward to the poll. Great question. Love the suggestions. I know which one I'm gonna vote for.

PS <see attached> 😂🤦🏻‍♂️ (nothing wrong there, just some font “fun” that's always frustrating).

@timbray One improving factor is an idea from Steve Bourne that I vaguely remember implementing in qed. It's a great simplification for escapes: Don't do the doubling. If you see a backslash, count how many there are, and remove only one for this layer of escape. It makes a huge difference in the wieldliness (to coin a term) of escape characters.

@robpike @timbray Wow, “qed” !

I never hear that mentioned.

As a small child, that was my first editor.

@robpike @timbray

"Wieldliness", that's well coined.

@timbray
1) At first I thought your “regular-depression” was a joke. Now thinking it was an auto-correctism.

2) instead of using a different escape character, wouldn’t it be better to protect the whole regex by sending it as pre-compiled binary? E.g. via a pre-processor mechanism.

@timbray The left guillemet is not hard to type on Mac (or iOS) keyboards: Option-\. It's thus even mnemonically tied to backslash.

But semantically I see « and I want to see a corresponding » (Shift-Option-\). So my spitball idea would be using left and right guillemets to "quote" the escaped special character:

«(»[^«n»«r»)]*«)»

@gruber I have to say that looks nice. Hmm, using a pair of enclosing markers suggest they could contain more than one character… So you could also have «P{Lu}» rather than «P»{Lu}. Not sure how I feel about that.

@timbray I thought of that too, but didn't want to send you down that rabbit hole. But it's intriguing.

The problem I'm thinking about is that sometime you want to escape a literal character: «(» would mean a literal open paren, but «n» would mean a newline. There aren't many non-literal escapes, though. So, another spitball (could be a truly horrible idea?): what if you keep backslash for non-literal escapes like \n and \r, but use «…» to mean “quote these characters literally”?

@timbray So you could type this to get three consecutive literal open parentheses:

«(((»

or

«\\»

to get 2 literal backslashes. Both of those are zillions more legible than `\(\(\(` or the infamous matchsticks of `\\\\`.

@gruber All this is compelling, but my library’s users are developers not civilians, and being able to just say “put an X wherever you used to put a \” is attractive. Also I'm kinda over inventing container syntax. Having said that, your idea is visually attractive.

@timbray Trust me, I totally, 100 percent see the appeal of “put an X wherever you used to put a \”.

But my mind starts turning…

@timbray You do you, but I would strongly recommend not using a different regexp-syntax.

The fact that vim does that is frustrating. I don't use it often enough to have its quirks memorized, so when I *do* want something specific, I pull my hairs.

Are your users *really* best served by having to look up bespoke syntax to use Quamina? I can guarantee you that I'd much rather have leaning toothpicks, than dealing with such quirks.

@Merovius @timbray could you support both, backslash and the new character, so that people have the choice?

@timbray Would ¬ work? Easy to access (on a UK keyboard*) and it kinda has semantics - "¬{" means "not (a normal) curly bracket".

* I've not checked American and other keyboards. Given their inferior placing of some other common programming symbols then it may not be available. I'd suggest standardising on UK keyboard to get all the benefits when programming 😉

@timbray Some thoughts:
* Of your choices, `~` seems most sane
* `%` is decent
* In `sed`, I tend to use bang: `s!foo/bar!foo/baz!`
* …which supports @jannem's point about letting you pick
* For the particular case of (presumable table-driven) Go unit tests, you could just write a helper that replaces `~` with `\\\\\\\\` and wrap in the test inputs in calls to that function.

ps. I stumbled across https://symbl.cc/en/unicode-table and found it lovely for browsing

Unicode Character Table - Full List of Unicode Symbols (◕‿◕) SYMBL

Explore the complete Unicode characters table on SYMBL (◕‿◕). Find every symbol, emoji, and special character in one place. Perfect for developers, designers, and anyone working with digital text. Browse, search, and discover the full range of Unicode characters effortlessly.

@timbray @jannem Completely unserious suggestion: U+0302 — Combining Circumflex Accent — a little roof to protect you from the elements/parser 😂

@timbray here is a totally different idea to address this: there are programmer’s fonts like #FiraCode that use ligatures to render multi-character sequences as more readable glyphs, e.g. “<=“ as “≤” or “->” as an arrow.

Maybe such a font could also render runs of backslashes with an overlay for their count. I.e four backslashes would be shown as “\\⁴\\“, and eight would be shown as “\\\\⁸\\\\“ (in very crude approximation) #ProgrammersFonts #Ligatures #Backslashes #FontDesign

@timbray you would still have to type four/eight backslashes for double/triple escaping, but you wouldn’t have to count them. The font could also highlight runs that have a length that is a power of two, so that you immediately see that you have reached that magic number
@timbray this isn’t meant to replace your proposal — I think I would want both solutions