Mastodawn

Graham Sutherland / Polynomial

protip: ALWAYS use regular expression literals in JavaScript and TypeScript and any other language that supports it, instead of writing your regex out in a string. I cannot count how many critical security bugs I have found over the years from someone writing a regex like "^en\.wikipedia\.org$", which is incorrect because the \. is treated as *string* escape sequence (an invalid one that just produces .) which then results in the regex being "^en.wikipedia.org$" which matches "enowikipedia.org".

Graham Sutherland / Polynomial 2d ago

this doesn't just come up in domain name allowlisting, it's eeeeverrryyywhere. the double escapes ALWAYS catch people out. use the regex literals, they'll save your ass. and if your language or toolchain or linter has a strict mode that can yell at you about bogus escape sequences in strings (or in regex literals too, for that matter) then turn it on and turn it up to 11.

Graham Sutherland / Polynomial 2d ago

imo languages should default to making invalid string escape sequences a compile error. so please go yell at (by which I mean "politely but firmly ask") your neighbourhood language standards committee to be strict on this. by default. so I don't keep finding this bug class for another 15 years.

Graham Sutherland / Polynomial 2d ago

also if you are unfortunate enough to have to PR slop code then this is a ridiculously common failure mode in LLM output and one you should look out for. the model can't reliably converge on tokens that are applicable to the context of a regex inside a string within the specific language, and instead tends to regress back to plain regex syntax or weird inconstent mixtures of string/regex escape sequences.

Graham Sutherland / Polynomial 2d ago

C# doesn't support regex literals, but it does at least throw a compilation error on invalid escape sequences in strings, and throws an exception if a regular expression contains an invalid sequence.

also both Visual Studio and VSCode have really solid syntax highlighting for regular expression strings, both of which can infer that the string is intended to be a regex by identifying the StringSyntax attribute on a method parameter. plus it has build-time compiled regexes.

Graham Sutherland / Polynomial 2d ago

if a language isn't going to include regex literals then C# is probably as good of a model to follow as you can get.

Julien Goodwin 2d ago

@gsuberland it's a shame that junyer's REKT transpiler never really made it in the real world

Claudius 2d ago

@gsuberland double-escaping is probably at least partly responsible for people finding Regular Expressions confusing. I mean, I get it, but this particular part is really not RegEx's fault.

Pxl Phile 2d ago

@gsuberland same with Golang.

But there are some catches with different string literal types. Either way, it is a runtime error during the regex compilation OR a fundamentally wrongly formulated regex 😂 but not an invalid escape.

Niels Abildgaard 2d ago

@gsuberland Javascript now has RegExp.escape for if you really really need a string-based regex (e.g. generating from data).

Good advice in this thread!

Richard "RichiH" Hartmann 2d ago

@gsuberland side note, allowlist and blocklist are better words

Graham Sutherland / Polynomial 2d ago

Graham Sutherland / Polynomial 2d ago

@RichiH adjusted

William D. Jones 2d ago

@gsuberland (Are regex literals different from raw strings?)

Graham Sutherland / Polynomial 2d ago

@cr1901 in JS and TS regex literals look like this:

const r = /^[a-z0-9]+\s[0-9]+$/i;

which is identical to writing:

const r = new RegExp("^[a-z0-9]+\\s[0-9]+$", "i");

it's a separate thing to raw strings.

Graham Sutherland / Polynomial 2d ago

@cr1901 note that I had to escape the backslash in the second one, which is error prone because if I mistakenly wrote \s then it would just be the letter s that got fed into the regex.

@cr1901 @gsuberland Specifically, in languages which support regex literals, these expressions *are not* free-text strings. The expressions can be checked as compile time, for example to ensure parens, brackets, and braces close properly.

Many such languages also have a more human-readable way to construct the expressions. For example, Swift’s Regex type supports writing an expression like this:

Capture(as: someInteger) { OneOrMore(.digit) }

It’s more verbose than “(?<someInteger>\d+)”, but complicated expressions get so much easier to understand.

Graham Sutherland / Polynomial 2d ago

@bob_zim @cr1901 also lets languages do really cool stuff like "hey, let's turn that regex into actual parser code at build time, so it gets all the benefits of compiler optimisations!"

Graham Sutherland / Polynomial 2d ago

@bob_zim @cr1901 (which you can also do using other syntactic approaches like attributes)

RootWyrm 🇺🇦

@gsuberland hey guess what LLMs that the slop peddlers insist are just the greatest at regexps do in basically every single regexp they generate.

Graham Sutherland / Polynomial 2d ago

@rootwyrm yes https://chaos.social/@gsuberland/116366592718228337

I’m 481 Phones 2d ago

@gsuberland oh, man, I just ran into this in HTML with <input>’s pattern attribute. Nothing breaking, but even after having just read the MDN page on it, I still made this mistake and didn’t catch it until later.

Death by Lambda 2d ago

Are you saying it should be...

"^en\\.wikipedia\\.org$",

... Instead?

@xdydx @gsuberland Depends on the language, which is another reason this is such a thorny problem. In many languages, yes, the expression should be as you wrote it.

In some languages (PowerShell, for example), the escape character for strings is something other than backslash, so the expression as you wrote it would be incorrect. In most of these cases, the expression wouldn’t match a real domain, which would be noticed in an allowlist entry but probably not in a blocklist entry.

To write correct expressions, you need to know implementation details like that, and most vendors hate giving those out.

Graham Sutherland / Polynomial 1d ago

@xdydx if you put it in a string, yes. but you should use regex string literals like /^en\.wikipedia\.org$/ if your language supports them so you don't need to double escape. so in TS/JS:

const reg = /^en\.wikipedia\.org$/;

Death by Lambda 1d ago

@gsuberland
Ah. Ok. I follow.

Thanks for the clear example!

@gsuberland yeah escaping is a touchy matter

@gsuberland Storytime!

I was once working in this Ruby on Rails shop, meh product but overall great people to work with.

One day I was reviewing the Brakeman configuration shipped with the code base and noticed that they turned on the “use double quotes everywhere” option.

Because this happened way earlier than my time there, I bought this up in the engineering slack channel, explaining exactly this corner case, and asking about what was the origin story for this choice. Mostly because Brakeman by default is smart enough to request single quotes when there’s risk of interpolation and preferring double quotes everywhere else.

The Beakeman config author came in very hot with an explanation that boiled down to “I wrote a blog post about the importance of unifying the coding style for readability and you should really go through it”.

Luckily enough, regexps weren’t really used across the code base: the most impactful place was during the deployment process when the homegrown deployment service needed to figure out what to do on different hosts based on their hostname. So, anyway, limited blast radius and all under engineering control.

Because of all of this, I chose this wasn’t an hill worth of dying on. I reiterated it was a slightly dangerous choice in the current status of the code base and moved along on more interesting and burning problems.

Fast forward three months later, many code changes and, if memory serves right, even a Ruby/Rails version upgrade. More regexps in the code base.

Things are getting wonky, the SREs are having trouble with deployments and no one understands why some core components are not behaving as expected.

Luckily we had paid support so they open a ticket with a sample of the puzzling code. The answer comes in quick and dry: “you are using double quotes, the string gets interpolated before being sent to the regexp handler” 😬.

The incident and the root cause are posted in the engineering slack channel for awareness.

I’m laughing my arse off and resurrecting the old thread from a few months back.

The Brakeman configuration author is fuming.

We change the option back to the original default.

Howard Cohen 1d ago

@gsuberland I already knew this because I read about it in the Enowikipedia.

@gsuberland similarly, use r"raw strings" instead of "normal strings" when using Python, to avoid similar bugs in regexes

(to be clear, in python, a string literal that starts with r" will not interpret any escape sequences but include each character exactly as typed into the string)