Mastodawn

Steve Troughton-Smith Mar 1

⭐️ New blog post: A Month With OpenAI's Codex

https://highcaffeinecontent.com/blog/20260301-A-Month-With-OpenAIs-Codex

It's been literal *years* since I last posted anything, so you know this is a big deal for me 😜

A Month With OpenAI's Codex

High Caffeine Content

Show thread

Anders Hovmöller Mar 1

@stroughtonsmith Codex and Gemini are also seen as inferior for programming compared to Claude by many I trust to know. For this use case of porting they are probably as good. But in situations with more ambiguity or the user gives bad advice Claude is far superior from what I've seen.

Show thread

Steve Troughton-Smith Mar 1

@boxed that might have been true up to the release of 5.3 last month, I'm not convinced that's still true. But these things have a lot of subjectivity

Show thread

Anders Hovmöller Mar 1

@stroughtonsmith https://x.com/petergostev/status/2026396167345459292?s=46&t=rFvA0C-h5tnMppUfb0s9JQ

It looks pretty bad for codex 5.3 imo.

Peter Gostev (@petergostev) on X

Link to the Repo: https://t.co/SkkvC6jcuf Link to the data viewer: https://t.co/b4q9uuJUhI

X (formerly Twitter)

Show thread

Žymantas Mar 2

@boxed @stroughtonsmith benchmarks of AI models are not everything (not saying, not important). It's as important as how agents manage context, system prompts, errors etc.

_random_agent_ + Opus 4.6 can be much worse, than _great_agent_ + Opus 4.6

Show thread

Anders Hovmöller

@cleanbit @stroughtonsmith for sure. But this one I think shows a real issue for this generation of models and might be the reason why programmers prefer Claude while Gemini beats it in benchmarks. Getting to the answer when an answer exists is nice and all, but how you respond to a crisis is imo more important. This goes for people, nations, and models :)