Mastodawn

Carl T. Bergstrom Feb 13, 2023

Meta. OpenAI. Google.

Your AI chatbot is not *hallucinating*.

It's bullshitting.

It's bullshitting, because that's what you designed it to do. You designed it to generate seemingly authoritative text "with a blatant disregard for truth and logical coherence," i.e., to bullshit.

Show thread

Ryan Moulton Feb 13, 2023

@ct_bergstrom Disagree. They're designed to mimic what a human would write. If they end up bullshitting it's because the models aren't good enough, not because that's what they're designed to do.

Show thread

Carl T. Bergstrom Feb 13, 2023

@moultano Humans have an underlying knowledge model. They have beliefs about the world, and choose whether to represent those beliefs accurately or inaccurately using language.

LLMs do not have an underlying knowledge model, they don't have a concept of what is true or false in the world. They just string together words they don't "understand" in ways that are likely to seem credible.

It's not a matter of making better LLMs; it'll take a fundamentally different type of model.

Show thread

Ryan Moulton Feb 13, 2023

@ct_bergstrom LLMs represent whether they "believe something to be true" in a way that you can extract unsupervised. Not disagreeing that their world model isn't good enough to be used without auxiliary retrieval, but there's some evidence they have one. https://arxiv.org/abs/2212.03827

Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

arXiv.org

Show thread

Ryan Moulton

@ct_bergstrom Another way of disentangling things. For the question you're asking, does the answer exist on the web? If it does, then the problem can't be with the "design" (I. e. the training regime) but rather the power of the model.

Show thread

Joe ❌👑Feb 14, 2023

@moultano @ct_bergstrom The answer may exist on the web along with its negation.