A cool test of how much different #AI models #hallucinate: the #BullshitBenchmark

The #Claude and #Qwen models seem to push back more when confronted with nonsensical questions. The #OpenAI models do not fare well.

Blog post: https://adam.holter.com/bullshitbench-v2-claude-and-qwen-are-the-only-models-that-push-back/
Results: https://petergpt.github.io/bullshit-benchmark/viewer/index.v2.html

#LLM

BullshitBench v2: Claude and Qwen Are the Only Models That Push Back - Adam Holter

BullshitBench v2 is out. Peter Gostev tested 70+ model variants across 100 questions spanning coding, medical, legal, finance, and physics. The benchmark measures one specific thing: whether a model will push back against a plausible-sounding but factually wrong statement, or just go along with it. Only two model families score meaningfully above 60% on bullshit […]

Adam Holter