When you tell AI models on what specifically to look out for in a coding task…

…they repeatedly, consistently, just won't care. At all. Ever.

That's your "vibe coding“ for y'all.

Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

AI is such a joke.

One more big „oof“, or perhaps laugh, for tonight:

gpt-3.5-turbo - the model that ChatGPT launched with over three years ago, scored 68/100 points on that benchmark. It's also the highest score of any model tested. The current gpt-5.2 scores 22/100. Higher means better.

Remarkable. I didn't expect that models regressed *this* much.

This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.
@kc Is there any way to get access to the tasks you're giving, how you're evaluating, or any other detials? Thanks!

@shriramk I’ll have a write-up ready soon.

However, because of overfitting, I cannot release the benchmark prompts and the complete methodology.

@kc That's what I feared. Understood. Thanks.

(I'm scheduling to teach students about a11y in a class that is sort of about programming with agents, so these would have made for great examples. So anything you can share would be lovely.)