Mastodawn

Steve Troughton-Smith Mar 1

⭐️ New blog post: A Month With OpenAI's Codex

https://highcaffeinecontent.com/blog/20260301-A-Month-With-OpenAIs-Codex

It's been literal *years* since I last posted anything, so you know this is a big deal for me 😜

A Month With OpenAI's Codex

High Caffeine Content

Show thread

Anders Hovmöller Mar 1

@stroughtonsmith Codex and Gemini are also seen as inferior for programming compared to Claude by many I trust to know. For this use case of porting they are probably as good. But in situations with more ambiguity or the user gives bad advice Claude is far superior from what I've seen.

Show thread

Steve Troughton-Smith

@boxed that might have been true up to the release of 5.3 last month, I'm not convinced that's still true. But these things have a lot of subjectivity

Show thread

Anders Hovmöller Mar 1

@stroughtonsmith https://x.com/petergostev/status/2026396167345459292?s=46&t=rFvA0C-h5tnMppUfb0s9JQ

It looks pretty bad for codex 5.3 imo.

Peter Gostev (@petergostev) on X

Link to the Repo: https://t.co/SkkvC6jcuf Link to the data viewer: https://t.co/b4q9uuJUhI

X (formerly Twitter)

Show thread

Steve Troughton-Smith Mar 1

@boxed asking a programming model the load bearing capacity of a vegetable garden isn't the kind of metric that matters to me. Give me a benchmark that tests codex 5.3 and opus 4.6 across a variety of codebases and project types for various platforms, and I would be interested, but I expect either model already vastly outclasses what I could ever need in my lifetime

Show thread

Anders Hovmöller Mar 2

@stroughtonsmith The point is to measure sycophantic acceptance of nonsense or incorrect data. The test here is silly to be sure, but the practical effect is real. You want a system that will push back on incorrect assumptions.

Show thread

Žymantas Mar 2

@boxed @stroughtonsmith benchmarks of AI models are not everything (not saying, not important). It's as important as how agents manage context, system prompts, errors etc.

_random_agent_ + Opus 4.6 can be much worse, than _great_agent_ + Opus 4.6

Show thread

Anders Hovmöller Mar 2

@cleanbit @stroughtonsmith for sure. But this one I think shows a real issue for this generation of models and might be the reason why programmers prefer Claude while Gemini beats it in benchmarks. Getting to the answer when an answer exists is nice and all, but how you respond to a crisis is imo more important. This goes for people, nations, and models :)