protip: ALWAYS use regular expression literals in JavaScript and TypeScript and any other language that supports it, instead of writing your regex out in a string. I cannot count how many critical security bugs I have found over the years from someone writing a regex like "^en\.wikipedia\.org$", which is incorrect because the \. is treated as *string* escape sequence (an invalid one that just produces .) which then results in the regex being "^en.wikipedia.org$" which matches "enowikipedia.org".
this doesn't just come up in domain name allowlisting, it's eeeeverrryyywhere. the double escapes ALWAYS catch people out. use the regex literals, they'll save your ass. and if your language or toolchain or linter has a strict mode that can yell at you about bogus escape sequences in strings (or in regex literals too, for that matter) then turn it on and turn it up to 11.
imo languages should default to making invalid string escape sequences a compile error. so please go yell at (by which I mean "politely but firmly ask") your neighbourhood language standards committee to be strict on this. by default. so I don't keep finding this bug class for another 15 years.
also if you are unfortunate enough to have to PR slop code then this is a ridiculously common failure mode in LLM output and one you should look out for. the model can't reliably converge on tokens that are applicable to the context of a regex inside a string within the specific language, and instead tends to regress back to plain regex syntax or weird inconstent mixtures of string/regex escape sequences.

C# doesn't support regex literals, but it does at least throw a compilation error on invalid escape sequences in strings, and throws an exception if a regular expression contains an invalid sequence.

also both Visual Studio and VSCode have really solid syntax highlighting for regular expression strings, both of which can infer that the string is intended to be a regex by identifying the StringSyntax attribute on a method parameter. plus it has build-time compiled regexes.

if a language isn't going to include regex literals then C# is probably as good of a model to follow as you can get.
@gsuberland it's a shame that junyer's REKT transpiler never really made it in the real world
@gsuberland double-escaping is probably at least partly responsible for people finding Regular Expressions confusing. I mean, I get it, but this particular part is really not RegEx's fault.

@gsuberland same with Golang.

But there are some catches with different string literal types. Either way, it is a runtime error during the regex compilation OR a fundamentally wrongly formulated regex 😂 but not an invalid escape.

@gsuberland Javascript now has RegExp.escape for if you really really need a string-based regex (e.g. generating from data).

Good advice in this thread!

@gsuberland (Are regex literals different from raw strings?)

@cr1901 in JS and TS regex literals look like this:

const r = /^[a-z0-9]+\s[0-9]+$/i;

which is identical to writing:

const r = new RegExp("^[a-z0-9]+\\s[0-9]+$", "i");

it's a separate thing to raw strings.

@cr1901 note that I had to escape the backslash in the second one, which is error prone because if I mistakenly wrote \s then it would just be the letter s that got fed into the regex.

@cr1901 @gsuberland Specifically, in languages which support regex literals, these expressions *are not* free-text strings. The expressions can be checked as compile time, for example to ensure parens, brackets, and braces close properly.

Many such languages also have a more human-readable way to construct the expressions. For example, Swift’s Regex type supports writing an expression like this:

Capture(as: someInteger) { OneOrMore(.digit) }

It’s more verbose than “(?<someInteger>\d+)”, but complicated expressions get so much easier to understand.

@bob_zim @cr1901 also lets languages do really cool stuff like "hey, let's turn that regex into actual parser code at build time, so it gets all the benefits of compiler optimisations!"
@bob_zim @cr1901 (which you can also do using other syntactic approaches like attributes)
@gsuberland hey guess what LLMs that the slop peddlers insist are just the greatest at regexps do in basically every single regexp they generate.
@gsuberland oh, man, I just ran into this in HTML with <input>’s pattern attribute. Nothing breaking, but even after having just read the MDN page on it, I still made this mistake and didn’t catch it until later.

@gsuberland

Are you saying it should be...

"^en\\.wikipedia\\.org$",

... Instead?

@xdydx @gsuberland Depends on the language, which is another reason this is such a thorny problem. In many languages, yes, the expression should be as you wrote it.

In some languages (PowerShell, for example), the escape character for strings is something other than backslash, so the expression as you wrote it would be incorrect. In most of these cases, the expression wouldn’t match a real domain, which would be noticed in an allowlist entry but probably not in a blocklist entry.

To write correct expressions, you need to know implementation details like that, and most vendors hate giving those out.

@xdydx if you put it in a string, yes. but you should use regex string literals like /^en\.wikipedia\.org$/ if your language supports them so you don't need to double escape. so in TS/JS:

const reg = /^en\.wikipedia\.org$/;

@gsuberland
Ah. Ok. I follow.

Thanks for the clear example!

@gsuberland yeah escaping is a touchy matter

@gsuberland Storytime!

I was once working in this Ruby on Rails shop, meh product but overall great people to work with.

One day I was reviewing the Brakeman configuration shipped with the code base and noticed that they turned on the “use double quotes everywhere” option.

Because this happened way earlier than my time there, I bought this up in the engineering slack channel, explaining exactly this corner case, and asking about what was the origin story for this choice. Mostly because Brakeman by default is smart enough to request single quotes when there’s risk of interpolation and preferring double quotes everywhere else.

The Beakeman config author came in very hot with an explanation that boiled down to “I wrote a blog post about the importance of unifying the coding style for readability and you should really go through it”.

Luckily enough, regexps weren’t really used across the code base: the most impactful place was during the deployment process when the homegrown deployment service needed to figure out what to do on different hosts based on their hostname. So, anyway, limited blast radius and all under engineering control.

Because of all of this, I chose this wasn’t an hill worth of dying on. I reiterated it was a slightly dangerous choice in the current status of the code base and moved along on more interesting and burning problems.

Fast forward three months later, many code changes and, if memory serves right, even a Ruby/Rails version upgrade. More regexps in the code base.

Things are getting wonky, the SREs are having trouble with deployments and no one understands why some core components are not behaving as expected.

Luckily we had paid support so they open a ticket with a sample of the puzzling code. The answer comes in quick and dry: “you are using double quotes, the string gets interpolated before being sent to the regexp handler” 😬.

The incident and the root cause are posted in the engineering slack channel for awareness.

I’m laughing my arse off and resurrecting the old thread from a few months back.

The Brakeman configuration author is fuming.

We change the option back to the original default.

@gsuberland I already knew this because I read about it in the Enowikipedia.

@gsuberland similarly, use r"raw strings" instead of "normal strings" when using Python, to avoid similar bugs in regexes

(to be clear, in python, a string literal that starts with r" will not interpret any escape sequences but include each character exactly as typed into the string)