Whenever I wonder what cool application I would personally build with #GPT models, I keep coming back to the problem that there's zero guarantee of "worst-case" performance.

When I put on my software engineering hat, my first thought is always to ask "what could go wrong and how?" The answer of #LLM|s is that things could go wrong in completely unpredictable ways, we just try to make it statistically unlikely.

1/3
#nlproc #gpt4 #chatgpt

Using #GPT to build a product is like programming a calculator that gives the right answer 90% of the time, but in 8% of cases fails in subtle and hard-to-notice ways, and in the remaining 2% it claims to be a potato farmer, insults the user, or deletes your hard drive.

Yet somehow people are okay with that, because when it's in the 90%, it's a really really awesome calculator?

2/3

Of course that's not a fair analogy (none is) — #GPT models can do the most impressive things that no other software could do before. But the problem of worst-case behavior remains, and I personally am totally put off by that.

I love the potential to use #AI models for creative uses, I just don't see myself wanting to build any other kind of serious application with them at this point. And I'm surprised that so many people don't seem to care.

3/3

@mbollmann Yeah that's my problem too. I've been prototyping various programs and the edge case failure is too big of a problem.

Seems like the only real option is "assistant" type programs which still a huge niche though a very different one from standard programs. I do think this could be improved with really good validation tools. e.g. GPT write a script for a very specific niche and then it's being rigously tested though for now that's more effort than it's worth.

@wraptile @mbollmann I'm actually very thankful that LLMs tend to work better as assistants. It leads to outcomes where humans are augmented through their collaboration with the tool rather than replaced. Hopefully this limitation of LLMs continue for a long time!

I think we'll find that this niche turns out to be huge.

@mbollmann Agreed. Especially if you're working with paying customers or data prone to bias, not being able to control the back box is problematic. You can put a Labs or Alpha label on it, or position it as a suggestion, but some risk of hallucinations that are brand-damaging remains. Will users come to accept the weirdness without blame? Unsure #GPT4
@mbollmann Well said. While I'm skeptical by nature, I also see value in "soft reliability" tasks, brainstorming, creativity, findings in huge data etc. But hard decisions should not be made or blindly accepted. Even in the support chat bot use-case, I'd fear that they'd give a wrong answer in 1% or 0.1% of cases. A human-curated FAQ that covers 90%, and human chat for the rest is still better, imho. But I'm also old, not a capitalist, and underestimate people's carelessness 🤷

@fabian I've seen several reports of researchers being contacted about papers they didn't write, because ChatGPT claimed that they did.

That's a relatively harmless failure case, but it's easy to imagine bigger problems resulting from people putting too much blind faith in a model's output. Especially if it's so convincingly presented.

@mbollmann just as an assistant, helping writing pieces of code or suggesting them, it works quite while. That is how many people use it i think ?
@ErikJonker Oh, absolutely. But even there you have to be alert to carefully check the suggestions, and they can be wrong in very subtle ways. I feel many people have too much blind faith in the output.

@mbollmann

Think like a business clown, not like an engineer.

You need something that seems to work on the surface and is cheap. Then you sell it for a profit and if people begin to cry you say, sorry people! software's always been shitty, everybody knows that. We might fix this in the next release if you're lucky.

Meanwhile you competitors, stupid enough to put actual work and effort into their product, silently leave the market because they are prohibitively expensive.

@mbollmann I mean, we live in a world where the single most important cryptography software is written in C. And it shows. And everyone is fine with that because that's how the world works.
Good thing AI wasn't around when they were inventing calculators! Because then we'd have calculators today that go like "how much is 7 by 23? LOL a lot."
@mbollmann I don't know about the potato farmer scenario, but failing in rare cases in subtle and hard to notice ways is what happens with the vast majority of software that I've ever used.