Mastodawn

Chris Mar 1

Steve Troughton-Smith

⭐️ New blog post: A Month With OpenAI's Codex

https://highcaffeinecontent.com/blog/20260301-A-Month-With-OpenAIs-Codex

It's been literal *years* since I last posted anything, so you know this is a big deal for me 😜

A Month With OpenAI's Codex

High Caffeine Content

Show thread

Steve Troughton-Smith Mar 1

Also, since this blog post touches on many different axes, from iOS to Android to Windows to design, and I only maintain a presence here, reposts into other communities are greatly appreciated! 😄

Show thread

Ken Kocienda Mar 1

@stroughtonsmith Great post. I'm not at all surprised that you get it.

“It didn’t just blow away my expectations, it showed me the world has changed: we’ve just undergone a permanent, irreversible abstraction level shift.”

“This story is unfinished; this feels like a first foray into what software development will look like for the rest of my life.”

Totally agree on both counts. 🚀

Show thread

Vítor Mar 1

@stroughtonsmith > something like Codex can chew through and rewrite a thousand lines of code in a second. Eventually, I just trusted it.

Jia Tan’s mistake was being too careful and wasting too much time on the social engineering. The next attacker will be far lazier than that, all they need is to poison the datasets (which is trivial, even by the vendors’ admission) and soon thousands of developers will be happily shipping unvetted malicious code which will compromise everyone beyond repair.

https://www.anthropic.com/research/small-samples-poison

https://www.bbc.com/future/article/20260218-i-hacked-chatgpt-and-googles-ai-and-it-only-took-20-minutes

A small number of samples can poison LLMs of any size

Anthropic research on data-poisoning attacks in large language models

Show thread

SeanMacGabhann Mar 1

@stroughtonsmith does this level of “trust” work with applications which have actual genuine real world use, not let projects?

Entire banking or financial systems?
Inter continental missile systems?

Show thread

Steve Troughton-Smith Mar 1

@SeanMacGabhann of course not, I wouldn't even trust it with a spreadsheet. That would be silly.

I also don't work on entire banking or financial systems, or ICBMs.

I trust it to write my code, not everybody else's

Show thread

SeanMacGabhann Mar 1

@stroughtonsmith

Thanks for the reply

But to me it’s where the diconnect/confusion lies

The sheer enthusiasm/belief in what you are doing versus what you wouldn’t trust it with

I think gen public/politicians and media don’t get the nuance

Show thread

Anders Hovmöller Mar 1

@stroughtonsmith Codex and Gemini are also seen as inferior for programming compared to Claude by many I trust to know. For this use case of porting they are probably as good. But in situations with more ambiguity or the user gives bad advice Claude is far superior from what I've seen.

Show thread

Steve Troughton-Smith Mar 1

@boxed that might have been true up to the release of 5.3 last month, I'm not convinced that's still true. But these things have a lot of subjectivity

Show thread

Anders Hovmöller Mar 1

@stroughtonsmith https://x.com/petergostev/status/2026396167345459292?s=46&t=rFvA0C-h5tnMppUfb0s9JQ

It looks pretty bad for codex 5.3 imo.

Peter Gostev (@petergostev) on X

Link to the Repo: https://t.co/SkkvC6jcuf Link to the data viewer: https://t.co/b4q9uuJUhI

X (formerly Twitter)

Show thread

Steve Troughton-Smith Mar 1

@boxed asking a programming model the load bearing capacity of a vegetable garden isn't the kind of metric that matters to me. Give me a benchmark that tests codex 5.3 and opus 4.6 across a variety of codebases and project types for various platforms, and I would be interested, but I expect either model already vastly outclasses what I could ever need in my lifetime

Show thread

Anders Hovmöller Mar 2

@stroughtonsmith The point is to measure sycophantic acceptance of nonsense or incorrect data. The test here is silly to be sure, but the practical effect is real. You want a system that will push back on incorrect assumptions.

Show thread

Žymantas Mar 2

@boxed @stroughtonsmith benchmarks of AI models are not everything (not saying, not important). It's as important as how agents manage context, system prompts, errors etc.

_random_agent_ + Opus 4.6 can be much worse, than _great_agent_ + Opus 4.6

Show thread

Anders Hovmöller Mar 2

@cleanbit @stroughtonsmith for sure. But this one I think shows a real issue for this generation of models and might be the reason why programmers prefer Claude while Gemini beats it in benchmarks. Getting to the answer when an answer exists is nice and all, but how you respond to a crisis is imo more important. This goes for people, nations, and models :)