Mythos finds a curl vulnerability

yes, as in singular one. Back in April 2026 Anthropic caused a lot of media noise when they concluded that their new AI model Mythos is dangerously good at finding security flaws in source code. Apparently Mythos was so good at this that Anthropic would not release this model to the public yet but instead … Continue reading Mythos finds a curl vulnerability →

daniel.haxx.se
My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing.
@bagder from my talks with people who had been given access to mythos in their org, they say it does find things which current tools miss, but also overlooks cases which current tools catch. so, yeah, to me it is "mostly marketing" combined with general FUD
@bagder but i have not asked them about exploit capabilities, though. there i cannot comment, and there it could be significantly better caps

@km As far as I can tell:

  • No one who has worked with raw Mythos output has ever written about it.
  • No one who has written about it has ever used it.

They would much rather have @bagder writing about it because his opinion carries weight. That means he can’t have direct access. To give him access, they’d demand to gag him with an NDA, like everyone else who has access.

This technique of making readers mentally fill in the gaps between what is verifiable and what is claimed is genius marketing and really dishonest. But we have come to expect systematic and casual dishonesty from these companies.

@paco @bagder yeah, let me clarify: i talked with people who not themselves used mythos, but whose org was given access, so yeah, they just told something which they were told

@km Yeah. I didn’t mean it personally. I wasn’t criticising what you said, I’m sorry if I sounded that way.

I was just pointing out this constant theme. The only thing that ever is made public is the fully-polished, human-vetted final result. They carefully hide all other details and the press don’t care.

@bagder

@bagder Yes. While I can't prove it, it tracks with A stealing the playbook of O who already said that they will likely pivot from B2C into B2B. One last fear mongering push and tons of directed compute at reputable projects and suddenly your marketing far surpasses that of any benchmark.

@bagder
In terms of evidence to the contrary:
Check out
https://social.security.plumbing/@freddy/116549451049357174 / the blog post:
https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/

>270 vulnerabilities found by Mythos fixed in a single Firefox release.

That's just one data point, but interestingly far off from yours.

Frederik Braun � (@[email protected])

Where do the people hang that read our hacks blog post and then went through all of the bugs that we opened up? Really eager for the deeper, informed takes now :) https://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox/

security.plumbing
@oots @bagder Firefox is a wildly more complex piece of software though (I assume), and they also fixed a lot of bugs found by other models in addition to those from Mythos. They don't really go into how much of the volume of bugs is due to Mythos itself, or just their experience and building a harness around the models by the time they had access to Mythos
The Boy That Cried Mythos: Verification is Collapsing Trust in Anthropic | flyingpenguin

@das_robin @bagder
Yes, #Firefox is probably a few orders of magnitude more complex than #curl and definitely much bigger.

Still, the blog post explicitly mentions "In addition to fixing the 271 bugs identified by Claude Mythos Preview in the 150 release, we’ve shipped more of these fixes in 149.0.2, 150.0.1, and 150.0.2.", so >270 attributed to #Mythos *alone*.

@bagder How do you explain that Mythos found 271 bugs in Firefox, and counting, and only 1 in cURL. Is the Firefox code base 271 times larger?
@gnirre I do not explain that at all because I don't have enough knowledge to do so.
@bagder Did Anthropic know that you finally had gotten access to Mythos?
@gnirre no idea, probably not
@bagder Maybe my question should have been if Alpha Omega knew? Your access was "inofficial"?
@gnirre I don't know how much they asked or told A about when this was done. It's not "my" access, someone else has the access and ran the analysis

@gnirre @bagder with the most glancing of looks, looking at the 150 version of firefox (and some rounding),
curl: 200k lines of c
firefox:

  • 5M lines of rust
  • 9M lines of C and C++
  • 200k lines of assembly
  • 2M lines of python

so like, without looking at anything else, firefox is significantly bigger

@4censord @gnirre @bagder Also, didn't they intentionally disable all mitigations, sandboxing etc. in Firefox *and* include every teeny tiny bug it found (without mentioning the false-positives, which were probably a metric shit ton) to bolster those numbers?

There were lots of shenanigans afaik.

@bagder This suggests a fun exercise for someone interested in messing around with LLMs:

1. Put back all the curl security issues previously found by LLM tools by dropping the fix commits from history or otherwise obfuscating the revert.

2. Feed the re-vulnerabilized repo to a selection of models and see what are the cheapest ones (by memory, time and/or monetary cost) that can find, say, 50%/75%/100% of the issues found by the warehouse-scale "foundation models".

Feels like a large part of the current results should be doable with significantly smaller resources, because being trained on every tweet and reddit post and libgen book ever is not obviously related to the task.

@redsakana @bagder

llm tools found security issues in curl? doubt

@bagder 💯☝️this
@peteriskrisjanis @bagder this says more about how well maintained curl is, than about anything. there is a finite amount of security issues to be found in stuff
@normis Normi, tu taču zini ka tas ir curl autors?
@bagder it's all marketing. And any improvements are completely moot, as the actual *costs* to find that single bug were in the tens of thousands of dollars minimum. That's the MINIMUM known cost.
It would not surprise me if finding that one bug cost $75k, $100k, $200k of compute time. It's a pile of shit, hilariously inefficient slop that sometimes behaves as a fuzzer that occasionally finds a crumb.
"Zero memory-safety vulnerabilities found." 💚
@bagder b-b-b-but curl is not in Rust!
@synlogic4242 @bagder Yes, someone really needs to get on to that rewriting thing. Just a pity there hasn't been a weekend in *years* so nobody had the chance!
@bagder thanks for this. It was really helpful to understand the hype around Mythos and also see that high quality in code matters a lot,especially if human driven
@bagder spectacular result! Huge congratulations to the entire team! Made my day :)
@bagder That reinforces my suspicions that there was a breakthrough for security at the start of the year, and that the rest of the year will be more quiet.
@bagder that "the size of curl" section is honestly incredible numbers.
@bagder hah! i was right!
@bagder
At least it works. It would have been quite a disaster if it found zero.
@alterelefant @bagder Are you a machine?
Classifying finding a single vulnerability (1) as success and 0 as failure sure seems like it
😁
The world is not black and white and the usefulness of LLMs for finding vulnerabilities IMO isn't either

@bagder Would it be a good idea to take an older version, where you already know you (as humans) found (and fixed) a certain number of vulnerabilities and see if AI can spot those correctly?

The Idee beeing to really have a quality test? ("For Science" ;) ).

Or are the all trained on your latest version already and that would invalidate that test?

@johnnythan I agree that would be an interesting challenge for someone with time and tokens to burn

@bagder LOL!

The report concluded it found five “Confirmed security vulnerabilities”. I think using the term confirmed is a little amusing when the AI says it confidently by itself. Yes, the AI thinks they are confirmed, but the curl security team has a slightly different take.

@bagder yessssssssss. we guessed right on the poll :D
@bagder I suspect the question is, will it still be a worthwhile tool when the actual price to use the tool, not subsidized by anyone's war chest or VC, is revealed?
@quinn my current opinion: for security scans and reviews, AI tools are and will be useful, but not to generate code. @bagder
@kleisli @bagder
if it's something like 10,000 euros a pop, it might not be worth security scans and reviews, except for governmental clients.

@quinn

Especially if it's subscription-based, as these models seem to be good at finding only specific sets of problems and then dry out, but even 10k per use is really gov or big corpo territory.

@kleisli @bagder

@0x0 @kleisli @bagder to be clear i picked that number out of my butt, but it is clear to me that it's going to be very hard to make up their investment in it, much less than the min 10x (which would probably be a couple trillion dollars)
@bagder the power of rigorous software engineering :D

@bagder not trying to buy into Anthropic's hype machine, but I wonder if curl is just a nonrepresentative code base. The average closed source / internal code base is probably worse in orders of magnitude when it comes to static checks, engineering principles, you name it.

I suspect Mythos will be useful in making poor software a bit more secure. That could have been done without AI of course.

@eskett I do emphasize that it is good at finding flaws. And so are many other models. So yes, they will certainly find many flaws in source code going forward. Mythos and the others.
@eskett @bagder Put another way, curl is the model to follow.

@bagder

AI powered code analyzers are significantly better at finding security flaws and mistakes in source code than any traditional code analyzers did in the past

I’m not sure this follows from what you’ve said in the rest of the post. Static analysers and fuzzers also made it very easy for people to find vulnerabilities and typically found a lot when they were deployed for the first time. And both were a lot cheaper to run than something like Mythos.

They aren’t finding as many vulnerabilities now because projects that are critical for security are integrating them into their CI flows.

And this is what always happens with some new technique: valgrind, Coverity, sanitisers, fuzzers, and so on: they’re released, they find a load of bugs that existing techniques failed to find, people fix them, they get integrated into regular CI runs, and the kinds of bugs that those tools find never make it into the tree.

Syskaller, for example, has found a lot more bugs in the Linux kernel than any Anthropic tools. And that’s just one fuzzing tool.

@david_chisnall i think it makes sense for everyone to run the "easy" and cheap tools first, and once they all find no more problems, then you bring out the bigger canons like AI analyzers. So yeah, which is "best" ? It probably depends.
@bagder @david_chisnall I'm not going to advocate actually doing this because it's expensive and I'm not a fan of the environmental impacts, but I am curious what it would find if you pointed it at the codebase from a time before the other precursor tools like fuzzers were in use. How many bugs can it find that you know with hindsight are there to be found?
@http_error_418 I agree, this would be a very interesting experiment - and potentially informative for other teams deciding where to spend limited developer time. @bagder @david_chisnall

@http_error_418 @bagder

The original Coverity paper claimed, as I recall, 300 CVEs. I'm not sure what the severity distribution was, but that seems a lot more than Mythos, and they probably used less compute than a single Mythos query.

The problem with any static analyser, whether it's based on formal reasoning or pattern recognition, is that it will be unsound (i.e. it will have false positives, in contrast with dynamic analyses that are incomplete and have false negatives). The LLM-based tools are no different in this respect. From a Claude 'comprehensive code review' of one of my projects, the only serious bug in the top ten that it found was one that already had an open PR to fix, and two were not only not bugs, they were intentional design choices and doing it the other way would have caused serious performance regressions (and not fixed bugs).

The thing that does make Mythos different is that it tries to build a PoC exploit. This will reduce the false positive rate, at the expense of creating false negatives (if it can't produce a PoC, you ignore it).

When I've used Coverity on a large project, it's found tens of thousands of bugs, and most of them are false positives, so it requires a lot of effort to find the ones that are actually important bugs. Something that produces PoCs automatically would help this a lot.

The baseline data point I'd really like to see is something that integrates the clang analyser with libFuzzer. For each report the analyser finds, insert profiling points at the branches on the control flow chain that it recommends, then automatically drive the fuzzer to try to trigger the code paths that the analyser reported as potential issues.

The default settings for the clang analyser are compilation-unit-at-a-time and with reduced bounds on loop iteration counts to avoid using enormous amounts of memory. If you're willing to spend as much money as it costs to operate the LLM-based tools, you can use the cross-compilation-unit approaches and bump the state up a lot. Running it configured to use a comparable amount of RAM to the GPUs that the Anthropic models run on would let you do a lot of symbolic execution.

I love it :

"The AI reviews are used in addition to the human reviews. They help us, they don’t replace us."

@bagder

Very cool writeup, as someone who has had occasion to dive into curl's source code recently. Range requests and content-encoding are one of the weirdest rough edges of the HTTP protocols, even after they gave up on transfer encoding entirely. I keep hoping someone's found a clever way around it.
@bagder In line with what this blog post stated shortly after it was announced: the model is nothing special and much cheaper models can find the same bugs. Marketing BS turned to 11. https://www.flyingpenguin.com/the-boy-that-cried-mythos-verification-is-collapsing-trust-in-anthropic/
The Boy That Cried Mythos: Verification is Collapsing Trust in Anthropic | flyingpenguin