Mastodawn

Carl T. Bergstrom Feb 13, 2023

Meta. OpenAI. Google.

Your AI chatbot is not *hallucinating*.

It's bullshitting.

It's bullshitting, because that's what you designed it to do. You designed it to generate seemingly authoritative text "with a blatant disregard for truth and logical coherence," i.e., to bullshit.

Show thread

Ryan Moulton Feb 13, 2023

@ct_bergstrom Disagree. They're designed to mimic what a human would write. If they end up bullshitting it's because the models aren't good enough, not because that's what they're designed to do.

Show thread

Carl T. Bergstrom Feb 13, 2023

@moultano Humans have an underlying knowledge model. They have beliefs about the world, and choose whether to represent those beliefs accurately or inaccurately using language.

LLMs do not have an underlying knowledge model, they don't have a concept of what is true or false in the world. They just string together words they don't "understand" in ways that are likely to seem credible.

It's not a matter of making better LLMs; it'll take a fundamentally different type of model.

Show thread

Ryan Moulton Feb 13, 2023

@ct_bergstrom LLMs represent whether they "believe something to be true" in a way that you can extract unsupervised. Not disagreeing that their world model isn't good enough to be used without auxiliary retrieval, but there's some evidence they have one. https://arxiv.org/abs/2212.03827

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

arXiv.org

Show thread

Joe ❌👑Feb 14, 2023

@moultano @ct_bergstrom They consume language and then produce language. Their "beliefs" can be about the structure of the English language (when generating text in English), like that adjectives that describe color always go after adjectives that describe size: "the little red hen", not "the red little hen". But they don't have a model of the external world.

Show thread

drg40

@not2b @moultano @ct_bergstrom Except there is no single English and you need to have a back story of your life and education and produce the relevant text. As far as I can see the current AI generated text reads like a machine trying to mimic a pompous ass deliberately trying to cause offence.
The media have discovered AI. The BS brigade have moved in.