Mastodawn

TIL: in javascript regexes, \b (word boundary) matches between an umlaut and a regular letter:

"aaäaaöaa".replaceAll(/\b\w/g, (x) => x.toUpperCase()) results in "AaäAaöAa"

@timotimo it looks like that matches ASCII only, in that sense, that is broken in JS. Likely what you think as umlaut is actually non-ASCII and it is not DWIM.

ASCII operations like that should not exist in todays world, but this is JS, expect many broken parts in there, mostly silent, like this.

If there is some sort of a REPL with debug mode, that will likely show you that what JS regex engine see is something like this "aaÃ¤aaÃ¶aa"

Show thread

Timo the timo 1d ago

@burak on a logical level, strings in javascript always operate on 16 byte codepoint values, so such a debug output would possibly look more like 0061 0061 00E4 0061 0061 00F6 0061 0061 (or the first and second half of each flipped, depending on endianness), but realistically, there's optimizations in engines like v8 to store data more compactly where possible, so this could actually really be seen as aaÃ¤aaÃ¶aa in memory, but only if you pretend it's latin1 rather than decoding the utf8 in there

sorry, I have a lot of painful experience on the lower-level end of this topic :D

Show thread

Burak Gürsoy 1d ago

@timotimo you are talking about unrelated things. Thew sucject is that regex method, not JS strings.

Show thread

Timo the timo

@burak then i'm not sure what your claim actually is? because the regex engine does understand codepoints like ä, as it matches \W as a single unit. If it only did matching in ASCII, it would have to match the two bytes of ä individually