"The code points represent every letter of the US alphabet when fed to computers, but their output is completely invisible to humans. People reviewing code or using static analysis tools see only whitespace or blank lines. To a #JavaScript interpreter, the code points translate into executable code."

Sigh.

https://arstechnica.com/security/2026/03/supply-chain-attack-using-invisible-code-hits-github-and-other-repositories/

Supply-chain attack using invisible code hits GitHub and other repositories

Unicode that's invisible to the human eye was largely abandoned—until attackers took notice.

Ars Technica
@simon_brooke as this is from Ars, do we actually know it happened? (I think I need to filter them after their Gen AI disinfo issues)

@otfrom I don't know, but it is entirely plausible to me that JavaScript, at least, would behave in this way.

And, indeed, the language I've been working on in odd moments would also be somewhat vulnerable to this attack, since it allows symbols in UTF characters.

@simon_brooke it sounds plausible, just that the source article there has burned their credibility
@otfrom @simon_brooke you mean the Aikido one?
Editor’s Note: Retraction of article containing fabricated quotations

We are reinforcing our editorial standards following this incident.

Ars Technica

@otfrom @simon_brooke ah, that one. I think it’s irrelevant now.

I checked the source of the article being discussed and it’s a security-related company that sells tools to catch the kind of attacks here described and it seems sounds to me. But I haven’t ever worked with buffets in JavaScript and I hope I never will, so I was wondering if there was something fishy there

@otfrom @simon_brooke

Given that the article provides two distinct bogus expansions of the PUA initialism when it comes to Unicode, there's certainly a whiff of verified-by-bullshit-generator about the article.

Plus, of course, there's the facts that (a) PUA glyphs are not zero-width, (b) programmers like to use 'nerd' fonts, and (c) even without that they'll show up as replacement characters in things like terminal emulators.

#Unicode #PrivateUseArea #ArsTechnica #AIslop #journalism

@JdeBP @otfrom I have a lot of more useful things I ought to be doing today, but I am tempted to see if I can write a proof-of-concept attack, simply so that we can find out what languages (and thus what projects) are vulnerable.

There are something like 135,000 code points in these 'private use areas', but I have a hunch that the problem is much wider; UTF-32 has a potential four billion code points, and very many of those are unassigned. What happens if you use them?

@simon_brooke

It's not the languages that have the interaction with #Unicode. The vulnerability in the languages is their ability to take code that is constructed at runtime in string form and interpret and execute it.

It's the text editors, IDEs, and pagers (that display the commit diffs) hiding the #PrivateUseArea (and, yes, unassigned code point) characters by rendering them as zero width.

But quite a lot of them don't. The code snippet in the article doesn't actually look like the screenshot given.

In the likes of text editors such as NeoVIM and VIM, and pagers such as less, more, most, and console-tty37-viewer, these characters are either emitted as narrow-width glyphs, which at minimum displays as mystery strings of replacement characters, or turned into reverse video hexadecimal code point values.

@otfrom

@JdeBP @simon_brooke @otfrom Also I suspect one compounding factor is that many web-editors will do some processing on ASCII letter for code formatting and syntax highlighting, but will basically forward all the “weird stuff” to the web-browser.

The same issue appears with bidi injections:

https://wiesmann.codiferes.net/wordpress/archives/20031

Understanding Bidi Injections

🔁 Most injection attacks follow the same pattern: a character or a sequence with a special meaning is not properly handled in user provided data, and ends up being interpreted the wrong way…

Thias の blog

@simon_brooke @otfrom err, there may be 4 billion arrangements of 4 bytes, but there are many fewer valid UTF-32 sequences, because there are many fewer Unicode codepoints. 0x10FFFF ~= 1 million is not a small number, but 3 orders of magnitude is more than a rounding error here.

though I can unfortunately imagine that failing to distinguish between UTF-32 and UCS-4 (as originally published) is not an uncommon error when it comes to handling encoded text.

I agree with @JdeBP that much of the blame for this vulnerability lies in user interfaces making the amazingly unwise decision to hide this text from the user (not that assigning blame is particularly meaningful: if an attack exists, it exists, and PUA abuse of all sorts is already out there, so…)