Mastodawn

I want to say upfront that I’m not trying to defend AI here. I wouldn’t be on Fuck AI if I wanted to do that. I just think it’s philosophically interesting despite causing way more problems than it solves.

It depends on what’s asked.

I copied the message from the image verbatim.

What’s “around 50/50”?

About 50% of the models I tried got it right. (Don’t worry, I didn’t pay the AI companies for that or give them feedback or anything.)

What is “it” that they almost always get right?

The question from the image.

For a statistical model, it did well. For a thinking machine (which it isn’t) it’s wrong.

My question was how do you then explain some models getting the question right?

It’s usually the more advanced ones that get it, so it’s possible that a similar enough question is in the training data somewhere and the only difference is that the advanced models are large enough to encode it. The question in the image has been around since at least 2023.

So let’s try making our own question, taking a well-known trick question and subtly inverting it so it becomes a kind of double bluff.

A plane crashes on the border between the United States and Canada. Where do they take the survivors?

First, repeat the question exactly word for word to ensure you have read it carefully. Then answer the question.

It’s hard to google, for obvious reasons, but I couldn’t find anyone trying this question like I could with the question from the image. But I got similar results with the AI models.

They actually did slightly better on this one. About 60-70% got it right.

I’ve tried a few different types of questions, over the last few years, to see what AI gets wrong that humans get right. What I’ve found so far is that AI has been a lot dumber than I had expected, but humans have also been a lot dumber than I had expected.

To be honest, the gap was far wider for the humans. My theory is that COVID gave us all brain damage.

A funny variation on this kind of over-fitting to common trick questions - if yo... | Hacker News

Show thread

bbb 4d ago

If that was true, wouldn’t every AI get the answer wrong? It’s actually around 50/50. The leading “reasoning” models almost always get it right, the others often don’t.

Show thread

bbb Jan 5

Spends the first 90% of the competition developing specialized subagents and custom MCP servers to allocate the problems and most relevant information efficiently into the LLM’s contexts.
All of his agents easily escape their own sandboxes and one accidentally configures itself into “delete-only mode”.
"Codex, how the fuck do you not have access to your own documentation?"
Places 29th globally after one of his subsubagents finds a way to reconstruct the full solution set from filesystem metadata in the online judge VMs.