Mastodawn

I saw a post recently wherein someone used LLM tools to analyze someone else’s software, which eventually led them to a conclusion that was essentially completely wrong. Not only that, the LLM drew conclusions about the *authors* behind the code that were borderline character assassination. Nevertheless, this person posted this output as though it were some kind of deep insight.

These LLM outputs are not independent thoughts. The LLM probably ingested hints of (maybe unconscious) biases in the user’s prompts within its context window, and regurgitated something that confirmed those biases. The user was pleased that their biases were confirmed (Independently! By an impartial LLM!), and they posted the output, maybe as vindication of their insight.

These models’ sycophancy can be subtle. They don’t have to state “You’re absolutely right!” to blow smoke up your ass. Sometimes they seem to confirm your preconceived notion after they supposedly “evaluate” information “independently”.

#ai

Show thread

Dave Rahardja 1d ago

Remember, LLMs are trained by humans who reward the models for creating output that “meet their expectations”. This kind of training cannot help but reward output that please the user, regardless of accuracy. Even if the most blatant sycophancy is explicitly addressed during training, *subtle* sycophancy is likely impossible to avoid, because they are indistinguishable from “meeting expectations” to human trainers.

#ai

Show thread

Mark Levison

@drahardja …. Yes and the reward function is pass/fall so it gets rewarded for sounding confident and bluffing. Instead you prefer a model to say “I’m 50% certain with this information”