LLMs have no model of correctness, only typicality. So:

“How much does it matter if it’s wrong?”

It’s astonishing how frequently both providers and users of LLM-based services fail to ask this basic question — which I think has a fairly obvious answer in this case, one that the research bears out.

(Repliers, NB: Research that confirms the seemingly obvious is useful and important, and “I already knew that” is not information that anyone is interested in except you.)

1/ https://www.404media.co/chatbots-health-medical-advice-study/

Chatbots Make Terrible Doctors, New Study Finds

Chatbots provided incorrect, conflicting medical advice, researchers found: “Despite all the hype, AI just isn't ready to take on the role of the physician.”

404 Media

Despite the obviousness of the larger conclusion (“LLMs don’t give accurate medical advice”), this passage is…if not surprising, exactly, at least really really interesting.

2/

There’s a lesson here, perhaps, about the tangled relationship between what is •typical• and what is •correct•, and what it is that LLMs actually do:

When medical professionals ask medical questions in technical medical language, the answers they get are typically correct.

When non-professional ask medical questions in a perhaps medically ill-formed vernacular mode, the answers they get are typically wrong.

The LLM readily models both of these things. Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.

3/

RE: https://girlcock.club/@miss_rodent/116041738842160668

This is a different, crisper way of saying what I meant by the previous post: if it sounds like a medical textbook, you’re more likely get a diagnosis; if it sounds like a tweet, you’re more likely to get a shitpost

The tone, vocabulary, and style of the question change the likelihood that the answer is correct.

4/

V (@[email protected])

@[email protected] This result makes sense - they generate *statistically likely* text based on a prompt, and the stolen words of basically the entire internet and several libraries worth of books. If the prompt is such that the text it generates is statistically-likely to be correct - the language used closely aligns with a medical textbook, diagnostic manual, etc. - it's more likely to generate text based on sources like that. If it sounds like a tweet, you're more likely to get a shitpost.

Girlcock.club
@inthehands Interesting results. The credo of many specialized ChatBot firms is that "the human in the loop" is still needed, meaning an expert in their field can make the best use of support by an AI. I thought that was mainly due to the expert spotting hallucinations. But the expert making the more expert-sounding inputs what results in more expertish outputs is a new aspect of this.
@inthehands ai helps the pros 10x, the novice not so much
@inthehands I continue to be well-served by treating LLMs as fancy autocomplete and not anthropomorphizing them. I feel like the chat interface is where things went sideways, making it too easy to believe that they "think"
@inthehands Worth noting, however, that when the training set captures a lot of outdated or irrelevant information, because the field has advanced rapidly since the model was trained, "typical" can start to diverge again. This can be mitigated if the practitioner knows to consult the latest information (either by reading it or by feeding it to the model as a part of the query) but of course they have to be aware of that. This is I suppose no worse than relying on the practitioner's knowledge.
@inthehands OTOH, as practitioners come to rely on stochastic information retrieval for more and more diagnoses, as it confirms what they already know, it may cause them to assign more weight to the information in the model than is justified, overruling their own second thoughts. ("Computer says...")
@inthehands One of the factors in this mess is the heavily-boosted notion that LLMs contain facts or knowledge. Coincidentally, sort of, but not really. A safer mental model is to think of them as a fuzzy virtual machine of sorts, not unlike a vibe-y JVM but programmed in something dressed as plain language. Garbage-in-garbage-out. Often anything-in-garbage-out.

@inthehands

I use Claude in my IDE every day. The LLM can only return what it identifies as Appropriate.

And the LLM will be the first to tell you so.

Particularly good:
>Despite having no notion of correctness in either case, correctness is more statistically typical in one than the other.

@inthehands Obvious to me. Having the same family doctor who knows you all for 20 years really is important and an immense privilege.
@inthehands This is why experienced developers can make use of LLMs, and why LLMs won't replace them.

@troed @inthehands

I see the high end #LLM experience like riding a good horse — exceptionally skilled in horsey things, moving fast, etc — an augmentation tool that’s exceptionally easy to use to augment your own abilities, not an #AI.

Ref 🧵https://federate.social/@Roundtrip/115549029949917075

Greg Lloyd (@[email protected])

Attached: 1 image @[email protected] Nice essay! I’ve been experimenting with Claude Sonnet 4.5’s extended thinking and web search as a research tool. I see the high end #LLM experience like working with a good horse—exceptionally skilled in horsey things, moving fast, etc.—a tool that’s exceptionally easy to use to augment your own abilities, not an #AI. Simile warning: I rode a horse—only when I was led around on one in grade school. But I’m a horse fan from Hopalong Cassidy through today.

federate.social
@inthehands An aside. When people used to ask Dawn wasn’t it hard to treat animals because “they can’t tell you what’s wrong,” she’d answer that they also can’t lie about it. She thought the latter probably outweighed the former.

@marick
That’s profound.

(Though also: I know that guinea pigs can be notoriously difficult to diagnose because, as prey animals, they’re very good at hiding that they have a problem!)

@inthehands Cattle are the same way. They’re very stoic. Llamas too, I think. I think it was a llama that was brought in because it seemed “a little off.” When they ran tests, it was hard to believe it could still be alive.

As compared to horses, who are complete wimps.