Colin Gordon

494 Followers
1.5K Following
2.3K Posts

Programming languages professor, kernel hacker, aspiring linguist (syntax & compositional semantics).

Currently figuring out how to combine all of my interests by mechanically translating English into formal specifications of a formally verified OS kernel for RISC-V.

      

pronounshe/him, er/ihn
languagesenglish (native), deutsch (~B2, Ich spreche ein bisschen Deutsch, aber nicht genug für alles zu benutzen), français (<A1, je parle seulement un peut le français), linguae latinae (relearning bit-rotted ~B1)
alts@csgordon and @csgordon
homepagehttps://csgordon.github.io/
“They Would Never Use the Death Star on Us”: Alderaan Residents Reflect on Their Support for the Empire as a Large Imperial Installation Enters the System

“For this focus group, conducted shortly before a sustained green glow began emanating from the station’s superlaser array, we spoke with residents who said ...

McSweeney's Internet Tendency

It's also the sort of thing that makes me wonder about those reports that "X% of new code is from AI." Measuring that always seemed slightly suspicious to me, as someone who spent years reading research papers that mined software engineering data --- is just basically impossible to get really accurate data without instrumenting developer workflows (tools), as anything else depends on careful heuristics.

But counting auto-inserted LLM tags in commit messages, regardless of whether any such tool was involved, is both easy and convenient for executives. Though probably however they measured was likely even more imprecise than that.

I was quite keen on VSCode for a long time, used it heavily (mostly for LaTeX and C#), but gradually started moving away as telemetry got harder to turn off years ago. I tried VSCodium, but a surprising number of extensions broke without the Microsoft components 🤔. So I was mostly off of it by the time Copilot showed up. After not paying attention to it for years... I'm really glad I switched away. I don't know how VSCodium is going to keep up their fork if Microsoft is flipping on auto-signing *all* commits as co-authored by Copilot on a whim. (I entirely missed the inserting ads into commit messages!)
https://mastodon.social/@danluu/116509588615166940
There are so many random IT systems at work that I get email from, that it's increasingly hard to tell the difference between a legitimate university system and a predatory conference, with all these emails from things with names like unisci and scisys and catalysis and ....
We've released a demo of the project @CanLehmann and I have been working on: @verijit is a meta-tracing Verilog emulator that specializes the simulated core to the machine code that is being executed on top of it. Was lots of fun to work on! I'm quite happy with the results too, we reach two orders of magnitude improvement over Verilator for a simple RISC-V core
https://www.youtube.com/watch?v=PXgUsEjvAOY
Verijit – Up to 100x faster Verilog simulation

YouTube

A few notes about the massive hype surrounding Claude Mythos:

The old hype strategy of 'we made a thing and it's too dangerous to release' has been done since GPT-2. Anyone who still falls for it should not be trusted to have sensible opinions on any subject.

Even their public (cherry picked to look impressive) numbers for the cost per vulnerability are high. The problem with static analysis of any kind is that the false positive rates are high. Dynamic analysis can be sound but not complete, static analysis can be complete but not sound. That's the tradeoff. Coverity is free for open source projects and finds large numbers of things that might be bugs, including a lot that really are. Very few projects have the resources to triage all of these. If the money spent on Mythos had been invested in triaging the reports from existing tools, it would have done a lot more good for the ecosystem.

I recently received a 'comprehensive code audit' on one of my projects from an Anthropic user. Of the top ten bugs it reported, only one was important to fix (and should have been caught in code review, but was 15-year-old code from back when I was the only contributor and so there was no code review). Of the rest, a small number were technically bugs but were almost impossible to trigger (even deliberately). Half were false positives and two were not bugs and came with proposed 'fixes' that would have introduced performance regressions on performance-critical paths. But all of them looked plausible. And, unless you understood the environment in which the code runs and the things for which it's optimised very well, I can well imaging you'd just deploy those 'fixes' and wonder why performance was worse. Possibly Mythos is orders of magnitude better, but I doubt it.

This mirrors what we've seen with the public Mythos disclosures. One, for example, was complaining about a missing bounds check, yet every caller of the function did the bounds check and so introducing it just cost performance and didn't fix a bug. And, once again, remember that this is from the cherry-picked list that Anthropic chose to make their tool look good.

I don't doubt that LLMs can find some bugs other tools don't find, but that isn't new in the industry. Coverity, when it launched, found a lot of bugs nothing else found. When fuzzing became cheap and easy, it found a load of bugs. Valgrind and address sanitiser both caused spikes in bug discovery when they were released and deployed for the first time.

The one thing where Mythos is better than existing static analysers is that it can (if you burn enough money) generate test cases that trigger the bug. This is possible and cheaper with guided fuzzing but no one does it because burning 10% of the money that Mythos would cost is too expensive for most projects.

The source code for Claude Code was leaked a couple of weeks ago. It is staggeringly bad. I have never seen such low-quality code in production before. It contained things I'd have failed a first-year undergrad for writing. And, apparently, most of this is written with Claude Code itself.

But the most relevant part is that it contained three critical command-injection vulnerabilities.

These are the kind of things that static analysis should be catching. And, apparently at least one of the following is true:

  • Mythos didn't catch them.
  • Mythos doesn't work well enough for Anthropic to bother using it on their own code.
  • Mythos did catch them but the false-positive rate is so high that no one was able to find the important bugs in the flood of useless ones.

TL;DR: If you're willing to spend half as much money Mythos costs to operate, you can probably do a lot better with existing tools.

Anthropic Claude Code Leak Reveals Critical Command Injection Vulnerabilities

Anthropic's Claude Code CLI contains three critical command injection vulnerabilities that allow attackers to execute arbitrary code and exfiltrate cloud credentials via environment variables, file paths, and authentication helpers. These flaws bypass the tool's internal sandbox and are particularly dangerous in CI/CD environments where trust dialogs are disabled.

BeyondMachines

While there are many cases made for coding LLMs, one that keeps coming up that I literally don't understand is "it will save us so much time typing." I've seen a lot of people push back on this with very sensible points, like that most of software development is not the process of entering code into files, but planning, designing, coordinating with other teams, and so on.

But it occurs to me that for some of the folks hung up on the "less typing" point, maybe a lot of them (specifically those repeating this point) just never learned to touch-type, so text entry really is a bottleneck for them? Is that a plausible source of some of this?

Do they teach typing in schools anymore? I still occasionally encounter an upper-level CS or SE students who is doing hunt-and-peck. I don't think it's taught in my kid's school, despite having a "computers" class (middle school). I was required to take a "keyboarding" class (i.e., typing) in middle school, though I actually learned earlier via Mario Teaches Typing (the only educational software I'm fully convinced works).

This is NOT a formal job posting, just testing waters. I have a year of post-doc money. Esp. int'd in formal methods + applied cogsci + diagramming. If you do work tied to my research, reach out (see my page). Must have US work auth, sorry. Please feel free to share/boost!

We are going to put some of the MoreVMs'26 talks online.

First up:
How Many Compilers Is Too Many? A Look at V8’s History, Tradeoffs, and Architectural Choices

By Leszek Swirski from Google's V8 team

https://youtu.be/LpJ79MxplUk

How Many Compilers Is Too Many? A Look at V8’s History, Tradeoffs, and Architectural Choices

YouTube

Do we sometimes ask computers today to do things that are genuinely beyond what those older machines could do? Sure, even focusing on typical consumer applications. Games are a big one.

But I'm not talking about asking these machines to do anything new; I'm talking about using them for things *we already know they're very capable of doing*, not just in principle but because we've already used them that way!

While I'm eager for the next AI winter, I have to wonder if prolonged pressure on RAM prices might finally push folks (companies, really) building consumer software to think "hmm, we should make sure this still runs on some machines with limited specs."
7/7(?)