When you tell AI models on what specifically to look out for in a coding task…

…they repeatedly, consistently, just won't care. At all. Ever.

That's your "vibe coding“ for y'all.

Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

AI is such a joke.

@kc

WTF
Sadly the joke isn't funny at all

Do you have an explanation for this?

The regression could be caused by accessibility being generally underrepresented.
I would assume this representation to decline with the visibility of the projects. Meaning large well known projects contain more accessibility than obscure code snippets in the dark corners of the internet.

If this is the case an increase of the training data by scraping the last bit of code would lead to a statistically worse representation of accessibility

The worse performance with expert guidance is "interesting". It shows again the core problem of LLMs or any existing AI. It doesn't, and can't reason.
Nevertheless i would expect that providing the expert guidance would increase the statistical correlation to the intended outcome.
But I could also imagine that there is a threshold of underrepresentation. Below which the expert guidances are stronger correlated to random outcomes than to the intended outcome

Tongue in cheek, there is a simple solution

The AI competitors could "solve" this by increasing the representation of accessibility in the training data by financing a massive push for accesdibility.

That would be money well spent even when AI fails in the end. But I sadly don't expect it to happen

@realn2s I have a broad idea of what's going on here, but I haven't verified it yet. I’m assuming it's that the models are "overthinking" the described guidelines, which leads to more complex outputs. However, data shows that outputs of these guided prompts, after reasoning, are generally shorter than outputs of those without. To verify this, I'll need a way to judge the complexity of the result, but that might be a far fetch for a project like this.