Mastodawn

I've never been opposed to the word "hallucinating" for describing how AI makes mistakes ... until now.

I just talked to someone who thought AI hallucinations would be obvious because it would be obvious if you talked to a *person* who was hallucinating.

In other words, they equated "hallucination" with "sounds wacko" and accepted AI output as true because it sounded level headed.

1/2

Show thread

Mignon Fogarty 4d ago

The word "hallucination" isn't going away — it's a widely used industry term — but we need to explain it better for beginners:

"Hallucination" is just a fancy word for "confidently makes mistakes":

"Remember: AI hallucinates, and you need to confirm all facts" should be something like "Remember: AI confidently makes mistakes, and you need to confirm all facts" or "AI tells you things that are wrong in a way that sounds completely believable. Confirm all facts!"

Show thread

Orion (he/him)4d ago

@grammargirl This is a good example of why that term is so dangerous. Thank you for posting it.

That said, while I have zero hope of making that term go away, we also have the word "slop" as a counter.

"Ugh. It had a hallucination..."

"Yup. And the results are now slop."

That said, I don't myself use "hallucination" in the "AI" context. I refer to the error rate, which last I checked, hovered around 40%.

Show thread

Mignon Fogarty 4d ago

@orionkidder Good point.

Also, the error rate now highly depends on which model you're talking about, but I think that's the rate for those that are most widely used -- e.g., the free models.

Show thread

Orion (he/him)

@grammargirl I'm seeing people claim the error rate is lower with other models, and I'm not sure I believe that since this industries just piles lies on top of lies, but the only plausible explanation of the lowered error rate I've seen is for Claude code.

Show thread

Orion (he/him)4d ago

@grammargirl If I understand correctly, it shoves every query through the "AI" multiple times and tests whether it does the thing it's asked to do, but of course, it hides all of that from the user.

Show thread

Orion (he/him)4d ago

@grammargirl To me, that feels like a brute-force workaround, a kludge, not an improvement in the tech itself. It's like saying, my car is too slow, so I'll attach a second engine to the hood.

Show thread

Riley S. Faelan 3d ago

@orionkidder No, that's probably how human brains do it. The genAI loop is wacky in other ways, but testing its results is not a wacky part of it.

@grammargirl

Show thread

C++ Wage Slave 3d ago

@orionkidder @grammargirl

I'm obliged to use LLMs at work.

In my limited experience, the error rate depends on whether the question you ask is covered by the model's training data. If so, the error rate will be fairly low (though not so low that the model becomes trustworthy). Otherwise, the error rate will approach 100% as the model just makes something up.

Of course, you never know what was in the training data, so you don't even know how reliable you can expect the model to be. In my experience, asking an LLM about material you can't find with a careful Web search is a good way to produce a screenful of friendly, grammatical, plausible rubbish.

Show thread

MarjorieR 3d ago

@CppGuy @orionkidder @grammargirl the error rate also also depends on what is in the training data.
That's no doubt is part of the problem with Grok, as it's training data contains many unreliable statements garnered from X as well as deliberate falsifications added.
If the training data was just Wikipedia you would get more reliable results.
For other AI vendors adding random chats from Facebook or Instagram or AI generated websites will also lower the accuracy.
Claude Code may be slightly better, for now, because it is just plagiarising code. This won't last as the code repositories fill up with AI slop and these are flagged up as such and excluded by the crawlers.
Likewise if you ask GenAI to summarise a document it may well incorporate data from its training data as well as the text you supply.
The other reason is that GenAI just doesn't simply reproduce single sources, whatever their accuracy. It acts as a stochastic mixer: if you see a AI generated legal case reference some of it may come from one citation and some from another and the legal inference drawn maybe from something entirely different.
Likewise if you ask GenAI to summarise a document it may well incorporate words from its training data as well as the text you supply.

Show thread

C++ Wage Slave 3d ago

@marjolica @orionkidder @grammargirl

I can't comment on Grok: I've never had an X account.

Claude Code has its problems. I use it not to generate code but to explore ways of working with unfamiliar libraries and languages when I can't find answers on the Web. (A library is a body of code packaged up by one developer or organisation for others to use.) I find it's wrong more often than it's right.

As an experiment, I once used Jira's AI to summarise a detailed comment that I'd written myself. The result was shorter, sure, but it was meaningless and unusable. After that experience, I never use an AI to summarise or rewrite anything.

Show thread

Riley S. Faelan 29m ago

@marjolica FWIW, I kind of suspect that a big reaason for why the Youtube recommender has been so problematic for years now is, it treats its output as its input. Autoplay amplifies this effect.

@CppGuy @orionkidder @grammargirl