Mastodawn

Casey Feb 20

When you tell AI models on what specifically to look out for in a coding task…

…they repeatedly, consistently, just won't care. At all. Ever.

That's your "vibe coding“ for y'all.

Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

Show thread

Casey Feb 20

Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

AI is such a joke.

Show thread

Casey

One more big „oof“, or perhaps laugh, for tonight:

gpt-3.5-turbo - the model that ChatGPT launched with over three years ago, scored 68/100 points on that benchmark. It's also the highest score of any model tested. The current gpt-5.2 scores 22/100. Higher means better.

Remarkable. I didn't expect that models regressed *this* much.

Show thread

Casey Feb 20

This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.

Show thread

Casey Feb 20

I've spent the night on building a campaign site, benchmarking even more models, and tinkering with the score calculation. And I've been trying to understand what happens here.

I have to be somewhere in 5 hours though, and need sleep desperately.

Good night, fedi.

Show thread

Casey Feb 20

Also, one last time: Benchmarking these models in a useful manner cost me several hundred euros, and of the big, most expensive models, I've only tested GPT 5.2, Opus 4.6 and Kimi K2.5 as of now. Gemini 3.1 pro, Claude Sonnet and gpt-5.3-codex should also be tested before taking this to media outlets, but I can't afford that right now.

If you can, I’d really appreciate your financial support: https://steady.page/de/bye-bye-barrieren/about

"Bye bye, Barrieren"

Kritische digitale Infrastruktur ist voller Barrieren. Ich dokumentiere, melde und begleite Probleme zivilgesellschaftlich.

Steady

Show thread

Casey Feb 20

Why I’ve posted about this today: I have finalized the plan today after writing prompts for the last couple of days, since one commenter here said that you gotta tell AI to make stuff accessible, and I remembered the bullshit AI study of Aktion Mensch I busted some weeks ago.

I've started the model runs today, and I'm only a single, independent, private researcher.

So bear with me please, this project will evolve further, like everything I do.

Show thread

Shriram Krishnamurthi Feb 20

@kc Is there any way to get access to the tasks you're giving, how you're evaluating, or any other detials? Thanks!

Show thread

Casey Feb 20

@shriramk I’ll have a write-up ready soon.

However, because of overfitting, I cannot release the benchmark prompts and the complete methodology.

Show thread

Shriram Krishnamurthi Feb 20

@kc That's what I feared. Understood. Thanks.

(I'm scheduling to teach students about a11y in a class that is sort of about programming with agents, so these would have made for great examples. So anything you can share would be lovely.)