Mastodawn

Casey Feb 20

When you tell AI models on what specifically to look out for in a coding task…

…they repeatedly, consistently, just won't care. At all. Ever.

That's your "vibe coding“ for y'all.

Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

Show thread

Casey Feb 20

Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

AI is such a joke.

Show thread

Casey Feb 20

One more big „oof“, or perhaps laugh, for tonight:

gpt-3.5-turbo - the model that ChatGPT launched with over three years ago, scored 68/100 points on that benchmark. It's also the highest score of any model tested. The current gpt-5.2 scores 22/100. Higher means better.

Remarkable. I didn't expect that models regressed *this* much.

Show thread

Casey Feb 20

This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.

Show thread

Shriram Krishnamurthi

@kc Is there any way to get access to the tasks you're giving, how you're evaluating, or any other detials? Thanks!

Show thread

Casey Feb 20

@shriramk I’ll have a write-up ready soon.

However, because of overfitting, I cannot release the benchmark prompts and the complete methodology.

Show thread

Shriram Krishnamurthi Feb 20

@kc That's what I feared. Understood. Thanks.

(I'm scheduling to teach students about a11y in a class that is sort of about programming with agents, so these would have made for great examples. So anything you can share would be lovely.)