Mastodawn

"The code points represent every letter of the US alphabet when fed to computers, but their output is completely invisible to humans. People reviewing code or using static analysis tools see only whitespace or blank lines. To a #JavaScript interpreter, the code points translate into executable code."

Sigh.

https://arstechnica.com/security/2026/03/supply-chain-attack-using-invisible-code-hits-github-and-other-repositories/

Supply-chain attack using invisible code hits GitHub and other repositories

Unicode that's invisible to the human eye was largely abandoned—until attackers took notice.

Ars Technica

Show thread

Chip Butty Mar 15

@simon_brooke as this is from Ars, do we actually know it happened? (I think I need to filter them after their Gen AI disinfo issues)

Show thread

Simon Brooke Mar 15

@otfrom I don't know, but it is entirely plausible to me that JavaScript, at least, would behave in this way.

And, indeed, the language I've been working on in odd moments would also be somewhat vulnerable to this attack, since it allows symbols in UTF characters.

Show thread

Chip Butty Mar 15

@simon_brooke it sounds plausible, just that the source article there has burned their credibility

Show thread

JdeBP Mar 15

@otfrom @simon_brooke

Given that the article provides two distinct bogus expansions of the PUA initialism when it comes to Unicode, there's certainly a whiff of verified-by-bullshit-generator about the article.

Plus, of course, there's the facts that (a) PUA glyphs are not zero-width, (b) programmers like to use 'nerd' fonts, and (c) even without that they'll show up as replacement characters in things like terminal emulators.

#Unicode #PrivateUseArea #ArsTechnica #AIslop #journalism

Show thread

Simon Brooke Mar 15

@JdeBP @otfrom I have a lot of more useful things I ought to be doing today, but I am tempted to see if I can write a proof-of-concept attack, simply so that we can find out what languages (and thus what projects) are vulnerable.

There are something like 135,000 code points in these 'private use areas', but I have a hunch that the problem is much wider; UTF-32 has a potential four billion code points, and very many of those are unassigned. What happens if you use them?

Show thread

SnoopJ

@simon_brooke @otfrom err, there may be 4 billion arrangements of 4 bytes, but there are many fewer valid UTF-32 sequences, because there are many fewer Unicode codepoints. 0x10FFFF ~= 1 million is not a small number, but 3 orders of magnitude is more than a rounding error here.

though I can unfortunately imagine that failing to distinguish between UTF-32 and UCS-4 (as originally published) is not an uncommon error when it comes to handling encoded text.

I agree with @JdeBP that much of the blame for this vulnerability lies in user interfaces making the amazingly unwise decision to hide this text from the user (not that assigning blame is particularly meaningful: if an attack exists, it exists, and PUA abuse of all sorts is already out there, so…)