I've never been opposed to the word "hallucinating" for describing how AI makes mistakes ... until now.

I just talked to someone who thought AI hallucinations would be obvious because it would be obvious if you talked to a *person* who was hallucinating.

In other words, they equated "hallucination" with "sounds wacko" and accepted AI output as true because it sounded level headed.

1/2

The word "hallucination" isn't going away — it's a widely used industry term — but we need to explain it better for beginners:

"Hallucination" is just a fancy word for "confidently makes mistakes":

"Remember: AI hallucinates, and you need to confirm all facts" should be something like "Remember: AI confidently makes mistakes, and you need to confirm all facts" or "AI tells you things that are wrong in a way that sounds completely believable. Confirm all facts!"

@grammargirl This is a good example of why that term is so dangerous. Thank you for posting it.

That said, while I have zero hope of making that term go away, we also have the word "slop" as a counter.

"Ugh. It had a hallucination..."

"Yup. And the results are now slop."

That said, I don't myself use "hallucination" in the "AI" context. I refer to the error rate, which last I checked, hovered around 40%.

@orionkidder Good point.

Also, the error rate now highly depends on which model you're talking about, but I think that's the rate for those that are most widely used -- e.g., the free models.

@grammargirl I'm seeing people claim the error rate is lower with other models, and I'm not sure I believe that since this industries just piles lies on top of lies, but the only plausible explanation of the lowered error rate I've seen is for Claude code.

@orionkidder @grammargirl

I'm obliged to use LLMs at work.

In my limited experience, the error rate depends on whether the question you ask is covered by the model's training data. If so, the error rate will be fairly low (though not so low that the model becomes trustworthy). Otherwise, the error rate will approach 100% as the model just makes something up.

Of course, you never know what was in the training data, so you don't even know how reliable you can expect the model to be. In my experience, asking an LLM about material you can't find with a careful Web search is a good way to produce a screenful of friendly, grammatical, plausible rubbish.

@CppGuy @orionkidder @grammargirl the error rate also also depends on what is in the training data.
That's no doubt is part of the problem with Grok, as it's training data contains many unreliable statements garnered from X as well as deliberate falsifications added.
If the training data was just Wikipedia you would get more reliable results.
For other AI vendors adding random chats from Facebook or Instagram or AI generated websites will also lower the accuracy.
Claude Code may be slightly better, for now, because it is just plagiarising code. This won't last as the code repositories fill up with AI slop and these are flagged up as such and excluded by the crawlers.
Likewise if you ask GenAI to summarise a document it may well incorporate data from its training data as well as the text you supply.
The other reason is that GenAI just doesn't simply reproduce single sources, whatever their accuracy. It acts as a stochastic mixer: if you see a AI generated legal case reference some of it may come from one citation and some from another and the legal inference drawn maybe from something entirely different.
Likewise if you ask GenAI to summarise a document it may well incorporate words from its training data as well as the text you supply.

@marjolica @orionkidder @grammargirl

I can't comment on Grok: I've never had an X account.

Claude Code has its problems. I use it not to generate code but to explore ways of working with unfamiliar libraries and languages when I can't find answers on the Web. (A library is a body of code packaged up by one developer or organisation for others to use.) I find it's wrong more often than it's right.

As an experiment, I once used Jira's AI to summarise a detailed comment that I'd written myself. The result was shorter, sure, but it was meaningless and unusable. After that experience, I never use an AI to summarise or rewrite anything.

@marjolica FWIW, I kind of suspect that a big reaason for why the Youtube recommender has been so problematic for years now is, it treats its output as its input. Autoplay amplifies this effect.

@CppGuy @orionkidder @grammargirl