This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.
| Official | https:// |
| Support this service | https://www.patreon.com/birddotmakeup |
| Official | https:// |
| Support this service | https://www.patreon.com/birddotmakeup |
I know you're right that there's a saturation point for context size, but it's not just context size that the larger models have, it's better grounding within that as a result of stronger, more discriminative attention patterns.
I'm not saying you're not going to drive confusion by overloading context, but the number of tokens required to trigger that failure mode in opus is going to be a lot higher than the number for gpt-oss-20b.
I'm pretty sure a model that can run on a cellphone is going to cap out it's context window long before opus or mythos would hit the point of diminishing returns on context overload. I think using a lower quality model with far fewer / noisier weights and less precise attention is going to drive false positives way before adding context to a SOTA model will.
You can even see here, AISLE had to print a retraction because someone checked their work and found that just pointing gpt-oss-20b at the patched version generated FP consistently: https://x.com/ChaseBrowe32432/status/2041953028027379806

@stanislavfort I am still extremely skeptical of this approach. I took the *patched* code snippet of the svc_rpc_gss_validate function, fed it to gpt-oss-20b, and it... incorrectly suggested that the vulnerability was still present (even though the code does a correct bounds check now).
newer models have larger context windows, and more stable reasoning across larger context windows.
If you point your model directly at the thing you want it to assess, and it doesn't have to gather any additional context you're not really testing those things at all.
Say you point kimi and opus at some code and give them an agentic looping harness with code review tools. They're going to start digging into the code gathering context by mapping out references and following leads.
If the bug is really shallow, the model is going to get everything it needs to find it right away, neither of them will have any advantage.
If the bug is deeper, requires a lot more code context, Opus is going to be able to hold onto a lot more information, and it's going to be a lot better at reasoning across all that information. That's a test that would actually compare the models directly.
Mythos is just a bigger model with a larger context window and, presumably, better prioritization and stronger attention mechanisms.
I get what you're saying, but I think this is still missing something pretty critical.
The smaller models can recognize the bug when they're looking right at it, that seems to be verified. And with AISLE's approach you can iteratively feed the models one segment at a time cheaply. But if a bug spans multiple segments, the small model doesn't have the breadth of context to understand those segments in composite.
The advantage of the larger model is that it can retain more context and potentially find bugs that require more code context than one segment at a time.
That said, the bugs showcased in the mythos paper all seemed to be shallow bugs that start and end in a single input segment, which is why AISLE was able to find them. But having more context in the window theoretically puts less shallow bugs within range for the model.
I think the point they are making, that the model doesn't matter as much as the harness, stands for shallow bugs but not for vulnerability discovery in general.