C# doesn't support regex literals, but it does at least throw a compilation error on invalid escape sequences in strings, and throws an exception if a regular expression contains an invalid sequence.
also both Visual Studio and VSCode have really solid syntax highlighting for regular expression strings, both of which can infer that the string is intended to be a regex by identifying the StringSyntax attribute on a method parameter. plus it has build-time compiled regexes.
@gsuberland same with Golang.
But there are some catches with different string literal types. Either way, it is a runtime error during the regex compilation OR a fundamentally wrongly formulated regex đ but not an invalid escape.
@gsuberland Javascript now has RegExp.escape for if you really really need a string-based regex (e.g. generating from data).
Good advice in this thread!
@cr1901 in JS and TS regex literals look like this:
const r = /^[a-z0-9]+\s[0-9]+$/i;
which is identical to writing:
const r = new RegExp("^[a-z0-9]+\\s[0-9]+$", "i");
it's a separate thing to raw strings.
@cr1901 @gsuberland Specifically, in languages which support regex literals, these expressions *are not* free-text strings. The expressions can be checked as compile time, for example to ensure parens, brackets, and braces close properly.
Many such languages also have a more human-readable way to construct the expressions. For example, Swiftâs Regex type supports writing an expression like this:
Capture(as: someInteger) { OneOrMore(.digit) }
Itâs more verbose than â(?<someInteger>\d+)â, but complicated expressions get so much easier to understand.
@xdydx @gsuberland Depends on the language, which is another reason this is such a thorny problem. In many languages, yes, the expression should be as you wrote it.
In some languages (PowerShell, for example), the escape character for strings is something other than backslash, so the expression as you wrote it would be incorrect. In most of these cases, the expression wouldnât match a real domain, which would be noticed in an allowlist entry but probably not in a blocklist entry.
To write correct expressions, you need to know implementation details like that, and most vendors hate giving those out.
@xdydx if you put it in a string, yes. but you should use regex string literals like /^en\.wikipedia\.org$/ if your language supports them so you don't need to double escape. so in TS/JS:
const reg = /^en\.wikipedia\.org$/;
@gsuberland
Ah. Ok. I follow.
Thanks for the clear example!
@gsuberland Storytime!
I was once working in this Ruby on Rails shop, meh product but overall great people to work with.
One day I was reviewing the Brakeman configuration shipped with the code base and noticed that they turned on the âuse double quotes everywhereâ option.
Because this happened way earlier than my time there, I bought this up in the engineering slack channel, explaining exactly this corner case, and asking about what was the origin story for this choice. Mostly because Brakeman by default is smart enough to request single quotes when thereâs risk of interpolation and preferring double quotes everywhere else.
The Beakeman config author came in very hot with an explanation that boiled down to âI wrote a blog post about the importance of unifying the coding style for readability and you should really go through itâ.
Luckily enough, regexps werenât really used across the code base: the most impactful place was during the deployment process when the homegrown deployment service needed to figure out what to do on different hosts based on their hostname. So, anyway, limited blast radius and all under engineering control.
Because of all of this, I chose this wasnât an hill worth of dying on. I reiterated it was a slightly dangerous choice in the current status of the code base and moved along on more interesting and burning problems.
Fast forward three months later, many code changes and, if memory serves right, even a Ruby/Rails version upgrade. More regexps in the code base.
Things are getting wonky, the SREs are having trouble with deployments and no one understands why some core components are not behaving as expected.
Luckily we had paid support so they open a ticket with a sample of the puzzling code. The answer comes in quick and dry: âyou are using double quotes, the string gets interpolated before being sent to the regexp handlerâ đŹ.
The incident and the root cause are posted in the engineering slack channel for awareness.
Iâm laughing my arse off and resurrecting the old thread from a few months back.
The Brakeman configuration author is fuming.
We change the option back to the original default.
@gsuberland similarly, use r"raw strings" instead of "normal strings" when using Python, to avoid similar bugs in regexes
(to be clear, in python, a string literal that starts with r" will not interpret any escape sequences but include each character exactly as typed into the string)