Whenever I wonder what cool application I would personally build with #GPT models, I keep coming back to the problem that there's zero guarantee of "worst-case" performance.

When I put on my software engineering hat, my first thought is always to ask "what could go wrong and how?" The answer of #LLM|s is that things could go wrong in completely unpredictable ways, we just try to make it statistically unlikely.

1/3
#nlproc #gpt4 #chatgpt

Using #GPT to build a product is like programming a calculator that gives the right answer 90% of the time, but in 8% of cases fails in subtle and hard-to-notice ways, and in the remaining 2% it claims to be a potato farmer, insults the user, or deletes your hard drive.

Yet somehow people are okay with that, because when it's in the 90%, it's a really really awesome calculator?

2/3

Of course that's not a fair analogy (none is) — #GPT models can do the most impressive things that no other software could do before. But the problem of worst-case behavior remains, and I personally am totally put off by that.

I love the potential to use #AI models for creative uses, I just don't see myself wanting to build any other kind of serious application with them at this point. And I'm surprised that so many people don't seem to care.

3/3

@mbollmann Yeah that's my problem too. I've been prototyping various programs and the edge case failure is too big of a problem.

Seems like the only real option is "assistant" type programs which still a huge niche though a very different one from standard programs. I do think this could be improved with really good validation tools. e.g. GPT write a script for a very specific niche and then it's being rigously tested though for now that's more effort than it's worth.

@wraptile @mbollmann I'm actually very thankful that LLMs tend to work better as assistants. It leads to outcomes where humans are augmented through their collaboration with the tool rather than replaced. Hopefully this limitation of LLMs continue for a long time!

I think we'll find that this niche turns out to be huge.

@mbollmann Agreed. Especially if you're working with paying customers or data prone to bias, not being able to control the back box is problematic. You can put a Labs or Alpha label on it, or position it as a suggestion, but some risk of hallucinations that are brand-damaging remains. Will users come to accept the weirdness without blame? Unsure #GPT4
@mbollmann Well said. While I'm skeptical by nature, I also see value in "soft reliability" tasks, brainstorming, creativity, findings in huge data etc. But hard decisions should not be made or blindly accepted. Even in the support chat bot use-case, I'd fear that they'd give a wrong answer in 1% or 0.1% of cases. A human-curated FAQ that covers 90%, and human chat for the rest is still better, imho. But I'm also old, not a capitalist, and underestimate people's carelessness 🤷

@fabian I've seen several reports of researchers being contacted about papers they didn't write, because ChatGPT claimed that they did.

That's a relatively harmless failure case, but it's easy to imagine bigger problems resulting from people putting too much blind faith in a model's output. Especially if it's so convincingly presented.

@mbollmann just as an assistant, helping writing pieces of code or suggesting them, it works quite while. That is how many people use it i think ?
@ErikJonker Oh, absolutely. But even there you have to be alert to carefully check the suggestions, and they can be wrong in very subtle ways. I feel many people have too much blind faith in the output.