Super frustrated with all the cheerleading over chatbots for search, so here's a thread of presentations of my work with Chirag Shah on why this is a bad idea. Follow threaded replies for:

op-ed
media coverage
original paper
conference presentation

Please boost whichever (if any) speak to you.

Chatbots are not a good replacement for search engines

https://iai.tv/articles/all-knowing-machines-are-a-fantasy-auid-2334

All-knowing machines are a fantasy | Emily M. Bender and Chriag Shah

The idea of an all-knowing computer program comes from science fiction and should stay there. Despite the seductive fluency of ChatGPT and other language models, they remain unsuitable as sources of knowledge. We must fight against the instinct to trust a human-sounding machine, argue Emily M. Bender & Chirag Shah.

IAI TV - Changing how the world thinks
Chatbots could one day replace search engines. Here’s why that’s a terrible idea.

Language models are mindless mimics that do not understand what they are saying—so why do we pretend they’re experts?

MIT Technology Review

Chatbots-as-search is an idea based on optimizing for convenience. But convenience is often at odds with what we need to be doing as we access and assess information.

https://www.washington.edu/news/2022/03/14/qa-preserving-context-and-user-intent-in-the-future-of-web-search/

Q&A: Preserving context and user intent in the future of web search

In a new perspective paper, University of Washington professors Emily M. Bender and Chirag Shah respond to proposals that reimagine web search as an application for large language model-driven...

UW News

Chatbots/large language models for search was a bad idea when Google proposed it and is still a bad idea even when coming from Meta, OpenAI or You.com

https://dl.acm.org/doi/10.1145/3498366.3505816

Situating Search | Proceedings of the 2022 Conference on Human Information Interaction and Retrieval

ACM Conferences

Language models/automated BS generators only have information about word distributions. If they happen to create sentences that make sense it's because we make sense of them. But dis-connected "information" inhibits the broader project of sense-making.

https://www.youtube.com/watch?v=VY1GHbU_FYs&list=PLn0nrSd4xjjY3E1qxXpWDoF7q-Q3d6g_A&index=17

Situating Search

YouTube

We must not mistake a convenient plot device — a means to ensure that characters always have the information the writer needs them to have — for a roadmap to how technology could and should be created in the real world.

https://mindmatters.ai/2022/12/why-we-should-not-trust-chatbots-as-sources-of-information/

Why We Should Not Trust Chatbots As Sources of Information

On a deeper note, they say, the pursuit of absolutely certain Correct Information suffers from a fundamental flaw — it doesn’t exist.

Mind Matters
@emilymbender thanks for sharing your insight. An anecdote from my toying around with #chatGPT: I asked it to show me an example of a program written in an imaginary combination of the best from the programming languages #Python, #Julia, #golang, and #Rust.
It wrote me a nice piece of pseudo-code that made sense. Furthermore, it could explain to me which traits represented characteristics from each language. Although it's probably not, it gave me an impression of creativity

@arildsen @emilymbender Well the thing with chatbots like ChatGPT is that they are very good at exactly that: giving you an IMPRESSION that they are good at something.

But they will absolutely lie through their teeth to do it, and it will be believable lies.

@WAHa_06x36 @emilymbender that sounds like a good point, but are you actually lying if you don't KNOW that you are lying?
@arildsen @emilymbender It doesn't really matter, the end result is the same: You get fed believable bullshit, and you either come away from the interaction less informed than you were before, or you spend an long time combing through the result trying to carefully separate out the truth from the fiction.
@WAHa_06x36 @emilymbender @arildsen The problem begins with the linguistically fuzzy insinuation of lying. A Chatbot can only produce results in response to a prompt and cannot lie because reflection or morality or any intention as a consciously acting agent is missing. The result can seem like a lie to us because nonsense is possibly presented like facts.
@cognisize @emilymbender @arildsen Not the point being discussed, though, is it? The question isn’t if it is moral for an AI to lie. It is that an AI will act in a manner indistinguishable from a human lying, which means it is less than useless, and actively harmful.

@WAHa_06x36 @arildsen @emilymbender This is kinda a category error, isn't it. As well-argued here, language models are incapable of producing factual statements, correct or incorrect. They can only produce poetry.

Unfortunately, we lack the language and metaphor to talk about statistical text generators and the human tendency to see peopleness everywhere doesn't help.

Language models can only write poetry

But only a person can write a poem

Allison Posts
@RAOF @WAHa_06x36 @arildsen You're referring to **Gwern**? They openly promote eugenics. Please stay out of my feed with any pointers to them.

@emilymbender @WAHa_06x36 @arildsen Urgh, sorry.

Thanks for the heads up. It's sad that some people have made AI a gateway to that cluster of terrible thinking.

That blog post only refers to a piece of Gwenn's work in the opening couple of paragraphs, as framing.

The author doesn't seem to be in the SSC/Rationalist/scientific racism orbit. Maybe they don't know? (I'll try to contact them)

Thanks again for the heads up.

@RAOF @emilymbender @arildsen Those circles are absolutely packed with eugenicists and scientific racists, there’s no way they will care.
@WAHa_06x36 there has to be someone in AI research who isn't marinated in longtermism 😬
@WAHa_06x36 @RAOF @arildsen AI is rife with it, true, but also lots of folks come across Gwern's stuff and cite it while being unaware of the rest and do appreciate the heads up.
@emilymbender @RAOF @arildsen Oh, I slightly misread the comment I was responding to anyway. Entirely agreed.
@RAOF That is an entirely uninteresting distinction, isn't it. Language models speak to you like a person, and they act like a person that is lying. The fact that this isn't a conscious choice is irrelevant to the actual outcome.

@WAHa_06x36 I think it's quite an important distinction? It's fundamental to how you should interpret text generated by a language model.

If you paint two dots and a downward facing semicircle on a rock, people immediately interpret the rock as being sad - : (

But we all know rocks can't be sad.

Similarly, language models are a really complicated pattern painted on a rock. The text they generate isn't true or false statements; it's randomly generated truthy. Many of the texts it generates will be interpreted as true statements, because lots of truthy strings are representations of true statements.

But saying GPT-3 lies suggests that you could make a language model that doesn't lie, or that isn't cavalier with the truth, and that's the wrong way to think about them.

Everyone knows rocks can't be sad; they don't know that language models can't tell the truth, but it's the same human cognitive failing that generates both.

@WAHa_06x36 I guess a simpler, but incorrectly anthropomorphic, way of saying that is that language models don't lie, they bullshit.
@RAOF I definitely never say that "GPT-3 lies", I say that language models lie. All of them, without exceptions.
@WAHa_06x36 @RAOF I do think it’s pretty important because “I am interacting with a person who lies to me and I may have to cajole the truth out of them” and “I am generating text with a model, but the model may generate things that are not true” leave me with very different conclusions as to how to interact with the model. For example, there’s no real point in trying to get the model to “slip up” like a suspect in a criminal investigation might. You can certainly shape the interaction like that, but then you’re just kind of hamstringing yourself.
@WAHa_06x36 @RAOF If it has to be anthropomorphic, the best I heard when I asked a while back was "It generates an answer that is very much like what a random person on the internet might answer". That is both true and useful insofar as it, as whoever wrote that pointed out, elicits about the right level of source criticism.

@arildsen @emilymbender

The positive/useful ChatGPT examples I’ve seen have mostly been coding examples.

@emilymbender The points you make about Trust are really critical. These language models are “often wrong, never in doubt.”

I have found that when asked to cite sources, OpenAI model will generate plausible-but-false URLs.

@emilymbender a bag-of-words is still a bag-of-words, even if it’s a fancy bag.

@emilymbender Hah. That's some... loaded wording, while we're speaking of trusting interactions by default.

Raw language models are a terrible way to acquire information, but they have lots of potential as an interface, given that there is already a lot of quick info search happening through language recognition through "dumb" assistant software.

The right place for a chatbot in the UX isn't as the search engine, it's as a parser of the query/result in some applications like text-to-voice.

@emilymbender And even then I'd suggest that for certain applications, like as accessibility tools, the downsides in information accuracy may be tolerable.

I'm also curious about the circularity of the argument that search engines share some of the same problems. Yeah, they do, and the pushback against them in the 90s looked a lot like this, too. But that ship has sailed, hit an iceberg and sunk in the middle of the ocean. We are starting from a search engine-saturated world already.

@emilymbender Which is to say, dumb algorithms are already pushing bias, we are already giving them too much credit and they already can be compromised by hostile techniques like SEO.

I don't think AI chat is inherently more believable because it sounds more human, that's just the uncanny valley of seeing new tech. We should hope to design these to do better than the old tech, but surely the bar for usability is to not do worse, which is much easier.

@emilymbender Alright, I'll stop threading and ranting, but just one more warning. A lot of both reasonable and unreasonable observations and criticism of generative AI, chatbots and the like is starting to degenerate into straight bias against ML, which is dangerous. ML is already ubiquitous and crucial to lots of fields, from astrophysics to game development.

Let's be careful to not let reasonable warnings about big data devolve into technophobia against the research field in general.

@emilymbender I am no linguist, but I have been toying around with chatGPT to get an impression of its capabilities.
Some replies impressed me and some made me facepalm repetitively. I remain very skeptical of their actual usefulness.
When you say that LLMs just know word distributions, what do you think of findings like this: Emergent Analogical Reasoning in Large Language Models, https://arxiv.org/abs/2212.09196?
Emergent Analogical Reasoning in Large Language Models

The recent advent of large language models has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here, we performed a direct comparison between human reasoners and a large language model (the text-davinci-003 variant of GPT-3) on a range of analogical tasks, including a novel text-based matrix reasoning task closely modeled on Raven's Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.

arXiv.org
Really interesting thread for a layman like me trying to make sense of all the hype surrounding this subject, not least in my field.
@emilymbender
To add to the paper @arildse linked, the following paper made me raise an eyebrow:
https://arxiv.org/abs/2212.03827
Discovering Latent Knowledge in Language Models Without Supervision

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

arXiv.org

@ar_lt I try not to spend too much time with preprints, i.e. work that has not been vetted by experts.

But: From the abstract, it sounds like they are trying to inject some external information (the meaning of negation).

Also: "latent knowledge" is an unfortunate overstatement.

@arildsen I try not to spend much time on preprints, i.e. work that has not been vetted by experts. On a quick skim of the abstract, this one seems off the deep end: "reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data." -- Human learning doesn't involve that scale of data. So what even is this research question?
@emilymbender just speculating here, but couldn't it be an interesting thought if human-like cognitive abilities could emerge from textual training alone, if the training material is just vast enough? I mean if the sheer amount of training material could somewhat make up for lack of diversity in modalities?
@arildsen There are interesting questions to be asked about what can be learned from distributional data alone. But calling that "human-like" is a) an overreach and b) seems motivated by a drive to build AGI which ... isn't science.
@emilymbender the authors cite this paper on the "reinvigorated debate": M. Mitchell, “Abstraction and analogy-making in artificial intelligence,” Annals of the New York Academy of Sciences, vol. 1505, no. 1, pp. 79–101, 2021. https://doi.org/10.1111/nyas.14619
I do not know it and haven't looked at it yet. I am just following the paper trail out of curiosity.
@emilymbender getting a worthless answer (and being forced into a crappy "conversation" that you know will end there) is the opposite of convenience.
@emilymbender I'm on the fence about this but I can accept that it's a vary tall fence