One of the things I'm finding so interesting about large language models like GPT-3 and ChatGPT is that they're pretty much the world's most impressive party trick

All they do is predict the next word based on previous context. It turns out when you scale the model above a certain size it can give the false impression of "intelligence", but that's a total fraud

It's all smoke and mirrors! The intriguing challenge is finding useful tasks you can apply them to in spite of the many, many footguns

And in case this post wasn't clear: I'm all-in on large language models: they confidently pass my personal test for if a piece of technology is worth learning:

"Does this let me build things that I could not have built without it?"

What I find interesting is that - on the surface - they look like they solve a lot more problems than they actually do, partly thanks to the confidence with which they present themselves

Figuring out what they're genuinely good for is a very interesting challenge

@simon Ha! Super true. People incorrectly put faith in ChatGPT for the same reason they incorrectly put faith in me when I'm on pub quiz teams. We just can't help but speak with undue confidence. I'd imagine that ChatGPT would also be a frustrating quiz team mate.
@wichitalineman Hah, I really love that model of ChatGPT where it's a pub quiz team member who's often right, sometimes wrong but expresses complete confidence in their answer either way!
@simon @wichitalineman I work in ML/AI for my full-time job and the authoritative tone that these LLMs communicate in coupled with the lack of proper sourcing is truly worrying for many people on my team. It will be interesting how Google goes about injecting LaMDA into search results with the lessons they have learned over the course of their company life.

@arthur @simon

Based on Google and other big tech firms previous behaviour, my guess would be:

1) LaMDA will have hard-to-predict consequences
2) Some of those consequences will be harmful
3) Google will drag their feet when acknowledging/taking accountability for the harm

@simon The applicable problem space is even fuzzier because there’s some things they do well, but not consistently. Which to me means the problem space has to be limited to areas where the cost of a false-positive/negative is low.

@dbreunig right - there are so many potential applications where you might get good results 90% of the time and utter garbage 10% of the time, which for things like loan application evaluation should be completely unacceptable, but might be fine for things like bulk sentiment analysis to identify general trends

There's a lot of depth just to learning how to identify places that the technology is a good fit

@simon Agreed. So far all appropriate apps fit in a) toys, b) selector interfaces (letting the user ‘approve’ final output) or c) fully mediated output (an artist in photoshop tweaking the bad parts of infill audiences never see). I’m sure there are more modes.

@dbreunig one area I'm particularly excited about it data extraction: given a huge jumble of badly OCRd documents, can a language model be used to ask questions of each scanned page to extract relevant facts from them?

If you applied data entry people to a task like that you'd also get a portion of errors

@simon Also agree. Bureaucracy—where forms and formulas have tried to interface with humans as if they were APIs—and it’s artifacts is where the potential energy lies. Anywhere there is rote tasks and boredom is the X marking the spot.

I call it the ‘Brazil Antidote’ model. Doesn’t get enough attention because it’s not sexy. I’m going to found the Boring AI Working Group.

@dbreunig I find it pretty interesting that a while ago web scraping was rebranded "robotic process automation" and quietly became a multi-billion dollar market https://en.m.wikipedia.org/wiki/Robotic_process_automation
Robotic process automation - Wikipedia

@simon Over a decade ago I built a consumer surveying tool inside a big media company. The trick used was all survey questions were open-ended, which cut down dramatically the spam responses because writing how you actually felt was easier than lying. (Unlike say, blindly clicking a radio button.) I then fed the responses through MTurk and similar to cluster the responses into multiple choice after the fact. Would be perfect for this stuff.
@simon @dbreunig The use case you describe is commonly done with BERT, which is another language model but I think it works in a different way. I would be slightly curious to see how ChatGPT compares, since BERT was more directly designed for that but ChatGPT is newer and larger.
@plznuzz @dbreunig I'd love to read more about using BERT for this kind of project - I wouldn't know where to even start with that right now

@simon

The way I look at it. Machine learning in general (including these large language models) are great when you have the following problem criteria

#1: You need to build a pattern matcher
#2: You don't know what to look for.
#3: When the pattern matcher is finally built you don't care to know what it actually looks for
#4: The results are allowed to be hilariously, insanely wrong some % of the time

And there are actually a lot of things that match that criteria

@ncweaver @simon im totally on board with ML if we acknowledge #4- also why I'm uncomfortabe with ML being used for 1:1 therapy situations at this point.

@pamelafox @ncweaver the 1-1 therapy thing is terrifying to me - imagine trying to get therapy from your iPhone keyboard!

But that said, I do use ChatGPT as an alternative to rubber duck programming sometimes: if I'm stuck on something I'll kick off a conversation purely as a thinking aid, and it's often effective

That feels pretty different to me from the therapy thing, but not a million miles away from it

@simon @ncweaver yeah to be fair, I'm totally gonna try chatgpt for therapy-light, like social situation advice, but I'm emotionally stable enough and am aware that chatgpt3 is just an LLM. my concern is for folks who may be close to harming themselves or others, and an overly humanized chatbot tells them something that guides them awry.
@pamelafox @simon
And I'm a security person, so most of my applications can't stand #4...

@ncweaver I really like that criteria list

It would be really useful if there was a solid, easy to understand list of use-cases and anti-use-cases to point people to

@simon I think for anti-use case, #4 itself captures it. "Is it OK if you are hilariously, outrageously, gobsmackingly wrong and you don't know it?"

Which is why I find Tesla's "AI first" development model for autonomy frightening, beyond just the fact that they are training it based on how Tesla drivers drive...

@ncweaver @simon I think 3 is a usual but not always. Sometimes for instance generative AI will bring out patterns that you can observe but weren't obvious before. "Oh, I see the model associates X with Y."
@ncweaver @simon I think there's a lot of gaming and entertainment applications specifically for experiences that are hard to script. Imagine an offline game where you can bargain or reason with NPCs in natural language, and sometimes they say something dead stupid, but that's part of the charm.
@ncweaver @simon is the list of “when delegating to another human is called for” different?
@hans @simon
Humans will tell you what they are matching on, and the hilariously wrong failure modes are often different.
@ncweaver @simon Essentially, the best use case is: Writing comedy.
@simon Dunning-Krueger as a service.
@simon

A missing 🧩 here is that making large language models is a huge destructive climate impact and we should quit it until we've got the climate sitch under control.

Using them is the same as any other app, so end-users and API-users, don't feel guilt-tripped. Making them, on the other hand, wrecks the world 💔

@Sandra I've not found that argument very convincing yet

Sure, there's a HUGE energy cost in training a model... but that model can then be put to use for many years into the future

text-davinci-003 was trained once, at great expense... but has since run inference millions (probably billions) of times for millions of people

This looks even better for openly released models like Stable Diffusion: trained once, then distributed to anyone who wants to use it

@Sandra I remember being amused when I saw one model that was trained by a university in France that boasted that 90% of the power used to train that model came from a nearby nuclear reactor!

Wish I could find that reference now

@simon

Seems to me like new models are popping up all the time these days 💔

@Sandra @simon

I'm not sure "years into the future" even matters.

My thinking is this: it doesn't matter if my shovel lasts "years into the future" or not. What matters is "how much use do i get out of the shovel before it stops being a useful tool?" That same amount of use may be spread out over years or months.

I think the question is mostly just about amortization.

Well, that + moral effects / consequences of its use.

@masukomi

That is an inapplicable solution to my complaint since externalities (in this case, that the planet will be wrecked) isn't fully factored into the cost of making the shovel, which means I want to optimize for as few new such shovels as possible regardless of how much they're used.

The fact that language model use is going up, making new models more "cost effective" in this leaky abstraction, is more a part of the problem than part of the solution.

The value the atmosphere needs us to optimize down isn't "fossils burnt divided by utility", it's "total fossils burnt". So increasing the amount of use cases harms more than it helps.

@simon
@Sandra @masukomi @simon I fully agree to your arguments amd think that the only justification for a model on this base would be that it's bringing down the energy to use. I'm not a big fan of smart devices but could imagine smartmeters to regulate heating, so in this direction there would be more benefit than invested energy.
@simon @Sandra
Many of these models will need to be retrained continuously because the users will expect them to be up-to-date with news events, celebs etc. So it is not train-once. And also, this totally ignores the cost of using these things at scale, which dwarfs the cost of training them.

@wim_v12e @Sandra I've been exploring alternatives to re-training the entire model to bake in new facts through mixing in results from other systems directly into the prompt - it's a really promising avenue: https://simonwillison.net/2023/Jan/13/semantic-search-answers/

The cost of using them is definitely enormous - I've seen reports of ChatGPT costing $3,000,000/day or more - but again, that's spread across many users

At least it's not Bitcoin mining!

How to implement Q&A against your documentation with GPT3, embeddings and Datasette

If you’ve spent any time with GPT-3 or ChatGPT, you’ve likely thought about how useful it would be if you could point them at a specific, current collection of text …

@wim_v12e @Sandra I remain hopeful that some day in the future it will become possible to run a large language model on a personal device - for both energy and privacy reasons

I can run Stable Diffusion on my iPhone already, but that's a MUCH smaller model that the various LLMs

@simon

Talking about making the models, not running them
@Sandra I was replying to @wim_v12e who said "And also, this totally ignores the cost of using these things at scale, which dwarfs the cost of training them."
@simon @Sandra
Even if they can be scaled down then that will lead to more instances of them, and very likely disproportionately more. I fear this technology might lead to quite a dramatic increase in emissions from computing.
@simon This is exactly what I was thinking today in the car when my wife asked: ”What’s on your mind?”

@simon

Humans readout is what gives rational interpretation of prompt responses that had been modeled for looking plausible.

Useful when those given responses are factual and accurate. Otherwise misleading interpretation and thus requiring critical filtering.

This is a critical failure of current models, so to be improved in future models.

@simon Research. It knows how to give properly referenced citations, so you ask it to write an article summarizing the research on the pros and cons of xyz topic, listing your concerns. And then you use that as a well-structured summary for digging into the articles -- and authors -- that it found. Maybe you should list it as a co-author? Or just mention using it as a research technique. Rather like using a reference librarian, though. Or an old-fashioned card-catalogue.