Mastodawn

TIL: in javascript regexes, \b (word boundary) matches between an umlaut and a regular letter:

"aaäaaöaa".replaceAll(/\b\w/g, (x) => x.toUpperCase()) results in "AaäAaöAa"

@timotimo it looks like that matches ASCII only, in that sense, that is broken in JS. Likely what you think as umlaut is actually non-ASCII and it is not DWIM.

ASCII operations like that should not exist in todays world, but this is JS, expect many broken parts in there, mostly silent, like this.

If there is some sort of a REPL with debug mode, that will likely show you that what JS regex engine see is something like this "aaÃ¤aaÃ¶aa"

Show thread

Timo the timo

@burak Let me interject for a moment.

What you guys refer to as "Umlauts" is in fact Unicode/Umlauts, or as I've recently taken to calling it, Unicode[C1 Controls and Latin-1 Supplement (Latin-1 Supplement)]. Umlauts are not a byte unto themselves, but rather a codepoint in a fully functioning system of encoding characters used by humans.

etc etc etc.

Show thread

Burak Gürsoy 3d ago

@timotimo ASCII is not any latin-X, it is a subset, that's for sure though. Try to locate some documentation about it, it should tell that those boundaries are for ascii.