Mastodawn

Jan Feb 13, 2023

Wow.. while we were all making fun of Google's Bard demo for making some small mistakes about the James Webb Space Telescope, it turns out the Bing demo was wildly hallucinating made up financial comparisons between Gap and Lululemon! https://dkb.blog/p/bing-ai-cant-be-trusted

Bing AI Can't Be Trusted

Microsoft knowingly released a broken product for short-term hype.

DKB Blog

Show thread

Simon Willison Feb 13, 2023

These are some seriously misleading errors!

> Lululemon’s gross margin is given as “58.7%”, which is a hallucinated value that doesn’t appear in their financial document. The real value is 55.9%.
>
> Lululemon’s operating margin is 19%, not 20.7%.
>
> Lululemon’s diluted earnings per share is $2.00 not $1.65.
>
> Cash and cash equivalents is wrong for Gap (should be $679 million) but correct for Lululemon.
>
> Inventory is wrong for Gap (should be $3.04 billion) but correct for Lululemon.

Show thread

Bruce Elrick Feb 13, 2023

@simon Let's test AI in production, the best kind of testing!

Show thread

Emma Builds 🚀Feb 13, 2023

@simon so you mean they are all crap? 😆

Show thread

Randy Au 🙃Feb 13, 2023

@simon can't wait for this whole situation to be written off as a collective hallucination

Show thread

SloanLA Feb 13, 2023

@simon @mattjhodgkinson It isn’t a small mistake. It’s how these work. There is no verification of anything they produce, breaking expectations of users everywhere.

Show thread

Matt Hodgkinson Feb 13, 2023

@SloanLA @simon There needs to be anchoring in verifiable information built in to make these tools of any use.

Show thread

Simon Willison Feb 13, 2023

@mattjhodgkinson @SloanLA the wild thing here is that's supposed to be how the Bing one works!

It runs regular searches and, according to the leaked prompts at least, instructs the language model to only use only those facts in its output, and provide citations

Problem is you can't actually tell a language model to do that - it's still going to predict random made up next tokens, because that's how language models work

Show thread

pascal guldener Feb 13, 2023

@simon @mattjhodgkinson @SloanLA yeah, this will go down in history as a (totally predictable) bs usecase . But does look like LLMs can be used on top of proper search. Check out this recent paper by FAIR : https://arxiv.org/abs/2302.04761

Toolformer: Language Models Can Teach Themselves to Use Tools

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q\&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

arXiv.org

Show thread

Matt Hodgkinson Feb 13, 2023

@simon @SloanLA You can lead an LLM to sources, but you can’t make it think.

Show thread

SloanLA Feb 14, 2023

@mattjhodgkinson @simon 100%

Show thread

A Tattered Scrapbook Feb 13, 2023

@simon Anyone who has ever used Bing as a search engine is completely unsurprised. It seems to have a built-in randomiser. There is a reason that it’s allowed in China; like, good luck finding anything on Bing. Bing AI was always going to be psychedelic babbling.

Show thread

Kaker Feb 14, 2023

@simon GPTs have no episodic memory, I guess they'll keep hallucinating. The transformer predicts a vector that is mostly a general idea and the final step is basically the decoder of a VAE so it will generate plausible sounding stuff from any general Idea. The way to improve would be to remember training data which search engines already are kind of doing and transformers are query/key/value based so should not be too long.

Show thread

Relly Annett-Baker Feb 15, 2023

@simon The memes about this from the inside have been Rather Good.