Mastodawn

Simon Willison Feb 5, 2023

If you try to use ChatGPT and similar language models as search engines they're going to lie to you, a lot, and you're at risk of writing off the whole space as hype

The trick is to learn what they're useful for and how to take advantage of them, which is actually quite a lot of work

Show thread

jwz Feb 5, 2023

@simon Every time someone says "you've just got to learn what these glorified Markov chains are good for" it sounds a lot like "but not ALL cryptocurrencies are Ponzi schemes, some are nice"

Show thread

Simon Willison Feb 5, 2023

@jwz it's weird, I'm very firmly in the "cryptocurrencies are a waste of everyone's time" camp, but the more time I spend with large language models the more convinced I am that they are going to let me solve all kinds of problems that I couldn't solve before

Nailing down exactly what those problems are is a lot more involved than I think most people expect though

Show thread

jwz Feb 5, 2023

@simon What kind of problems?

Even if there are upsides, the downsides seem pretty severe. These systems are optimized for bullshit and lies, like, that's their core competency to which all use degrades. See also "facial recognition is the plutonium of AI". https://jwz.org/b/yjMP

Show thread

Simon Willison Feb 5, 2023

@jwz Extracting structured data from unstructured text is one promising angle - I'm interested in the potential for investigative data journalism, for problems like turning 20,000 ad-hoc poor quality scanned police complaint reports into actionable information, without spending six months on human-powered data entry first

Show thread

Amy Hoy Feb 5, 2023

@simon @jwz without spending six months on human-powered data entry first, you will never be able to know if the GPT result is a series of hallucinations, like so much of its output

Show thread

Simon Willison Feb 5, 2023

@amyhoy @jwz I'm talking about prompts like "Here is a copy and pasted police report. Return JSON with the names of the mentioned officers and the date of the incident"

My hunch here is that spot checks on the results could help tell if it's working well enough, and that the end results would reach the same level of accuracy as asking human data entry people (who are also infallible) to do the same task

If there's a better way to do this than using a language model I'm interested to hear it

Show thread

jwz Feb 5, 2023

@simon @amyhoy
But how could you ever trust that data? You are asking for *facts* but the system is optimized to produce *believable answers* which are not at all the same thing.

Suppose the system optimizes its march to the goal by just making up some numbers that subtly (or not so subtly) tilt the data one way or another. Now you've built a black box to confirm your biases.

And the black box, by its nature, cannot "show its work" without lying.

Show thread

Simon Willison Feb 5, 2023

@jwz @amyhoy The black box thing is why I'm finding this whole space so utterly beguiling

I hate that it's a black box. But I've spent my entire career working with computers that do exactly what you tell them... and now I'm faced with one that very much does not do that

It's like someone's given me a spell that raises actual dragons from another dimension and challenged me to try and tame them!

Show thread

Simon Willison Feb 5, 2023

@jwz @amyhoy I can't see this tech being un-invented, so the interesting question to me now is what I can build with it now that I couldn't build before - and what are the new, genuinely valuable problems I can solve for people

Show thread

jwz Feb 5, 2023

@simon @amyhoy I've got no time for that attitude, whether applied to dangerous software or chemical weapons. We regulate things that cause harm.

Show thread

Ian McKellar Feb 5, 2023

@jwz
@simon @amyhoy

Actually, we generally don't do a good job of regulating things that hurt working class people.

Show thread

sayrer Feb 5, 2023

@ian @jwz @simon @amyhoy It is actually really good at syntactic problems. That seems fine to me. If you go looking for the truth, that might be misguided.

Show thread

Amy Hoy Feb 5, 2023

@ian @simon @sayrer @jwz there are so many cases of it doing wrong (very basic!!) math and “explaining” why its wrong stuff is right. not to mention anything more complex than 2+2=4, like word problems. so i would say no, it isn’t very good at syntax problems.

Show thread

Simon Willison Feb 5, 2023

@amyhoy @ian @sayrer @jwz It's a next-token-predicting language model, so using it for math is very much the wrong application of it - that's one of the many reasons I keep trying to convince people that these things are deceptively difficult to use effectively

The idea that a computer can be bad at math is very counter-intuitive!

Show thread

Anil Dash Feb 5, 2023

@simon @amyhoy @ian @sayrer @jwz I think a key part here is making sure those of us who are critical are specific and fluent in the systems we’re criticizing, rather than blanket dismissal that sounds glib because we’re fast-forwarding to the conclusion instead of showing our work.

Show thread

Anil Dash Feb 5, 2023

@simon @amyhoy @ian @sayrer @jwz What Molly White (and to some degree, Moxie) did in breaking down the faults and flaws of crypto assertions did far more to hasten good regulation than any amount of “it a bunch of dumb scams!” ranting did. Simon’s path here seems more likely to yield effective harm reduction.

Show thread

Jeff C. 🇺🇦🇬🇱Feb 5, 2023

@anildash @simon @amyhoy @ian @sayrer @jwz I just wish it were easier to separate the wheat from the chaff.

As someone who is generally positive toward this technology, I can still benefit from thoughtful critiques on its efficacy informed by experts who understand its limits.

Even amongst those who see value here, it’s important to understand what we’re dealing with and how far it can be taken safely.

Show thread

Jeff C. 🇺🇦🇬🇱Feb 5, 2023

@jeff @anildash @simon @amyhoy @ian @sayrer @jwz But much of the discourse seems like motivated reasoning, as it’s perceived as posing a risk to certain professions.

For many it seems less about the technology’s limits and more that they don’t _want_ it to be/get good.

Show thread

Glyph

@jeff @anildash @simon @amyhoy @ian @sayrer @jwz what would it mean for this tech to “be good”? Its entire *purpose* is to be a bullshit fountain. Like, at minimum, to be able to provide reliable outputs, the training data would need to be editorially flagged as “true” or “false”, not an undifferentiated slurry of Internet Words, which is such a monumental undertaking as to make the whole effort no longer cost-effective.

Show thread

Glyph Feb 5, 2023

@jeff @anildash @simon @amyhoy @ian @sayrer @jwz as specified, an LLM’s job is always to repeat common misconceptions or likely errors, not to produce accurate results. By construction its erroneous outputs will always be maximally unsurprising so as to subvert spot-checking; its stipulated goal is just to make the same mistakes the median human would make, just… faster

Show thread

sayrer Feb 5, 2023

@glyph @jeff @anildash @simon @amyhoy @ian @jwz

Is this right? I think it seems ok. I don't use Twisted, but it seems about right and I could fix anything I don't like or that is in fact incorrect.

Show thread

Jeff C. 🇺🇦🇬🇱Feb 5, 2023

@sayrer @glyph @anildash @simon @amyhoy @ian @jwz I don’t use Python so I can’t say, but I have had it write significantly more complex command line tools to do similar in Swift which worked.

Granted, it took a few follow-up replies to get it right, and I benefit from being a Swift dev.

Still faster than writing from scratch, and I have no reason to assume it won’t improve in time.

Show thread

sayrer Feb 5, 2023

@jeff @glyph @anildash @simon @amyhoy @ian @jwz Right, so you can get something like "create-react-app", but more versatile. I don't really object to project templates, but they don't write programs either.

Show thread

Glyph Feb 6, 2023

@sayrer @jeff @anildash @simon @amyhoy @ian @jwz reproducing small, uncontroversial examples is something that it undoubtedly excels at. In this case it’s reproduced a bad, legacy way of accomplishing this, but I can’t fault it for that; the overwhelming majority of historical training data would present it that way. (In fact this is so short it’s nearly plagiarized from historical documentation)

Show thread

Glyph Feb 6, 2023

@sayrer @jeff @anildash @simon @amyhoy @ian @jwz FWIW I do strongly agree with Anil here — my minor gripes here are not going to lead to substantive policy outcomes, for that we will definitely need an “AI is going just great”; and, no disrespect to molly, but AI is a less target-rich environment than blockchain (it’s hard to imagine a *more* target-rich environment than that) so this is going to be a bigger lift

Show thread

sayrer Feb 6, 2023

@glyph @jeff @anildash @simon @amyhoy @ian @jwz so, the key here is that you can have it elaborate. I wrote "Can you write one..." and it understood that. That is a good advance, and I think it's a mistake to focus on the answers, which I agree are mostly precooked.

Show thread

Glyph Feb 6, 2023

@sayrer @jeff @anildash @simon @amyhoy @ian @jwz and this is definitely wrong, in kind of a funny way

Show thread

sayrer Feb 6, 2023

@glyph @jeff @anildash @simon @amyhoy @ian @jwz I'm cool with that, I've never written a program in Twisted, but I knew you wrote it. How is it wrong?

Show thread

Glyph Feb 6, 2023

@sayrer (removing the large CC list here, because I don't think this is of quite so broad an interest)

1. it never sets a content type, so it's not controlling the interpretation of the response
2. it's manually doing quoting rather than using the built-in twisted.web.template
3. it's assuming it's emitting HTML but it doesn't enclose anything in an HTML document
4. the request isn't necessarily in UTF-8, so there's maybe some wiggle room for an encoding-confusion attack here

Show thread

Glyph Feb 6, 2023

@sayrer it's also doing some stylistic stuff wrong that other examples do wrong; it's using listenTCP rather than endpoints, it's not encapsulating its main fucntion in an `if __name__ == '__main__':` block, just executing it at the top level of the script so it can't be a module, since it's doing listenTCP it can't do HTTPS (which is, I should say, the common-est security problem)

Show thread

Glyph Feb 6, 2023

@sayrer it's just the sort of thing that I would expect a cut-rate doesn't-really-know-Twisted consultant to come in and do

Show thread

Glyph Feb 6, 2023

@sayrer it also should probably be in a rpy so you can use `twist web` and not run it directly with python but that's really nitpicking :)

Show thread

sayrer Feb 6, 2023

@glyph Right, it's actually interesting! I knew it was doing the Python wrong (it's always wrong, and they seem to just make a habit of changing the rules...), and all of the HTTP stuff was a little too simple (I'm in the HTTP 1.1 acks), but I thought it was funny that it got so close.

Show thread

sayrer Feb 6, 2023

@glyph Your #1 and #4 are not required, since I think you would hit the chardet stuff, and #3 seems pretty picky (the HTML5 algorithm would automatically insert the needed elements). #2 I can't speak to, but I believe you that it's not idiomatic.

So you have a thing that would actually totally work, but has been arrived at in a strange way.

Show thread

sayrer Feb 6, 2023

@glyph Now, of course the next step is to put this generated code on the internet and see what happens.

Show thread

ShadSterling Feb 6, 2023

@glyph @sayrer @jeff @anildash @jwz @amyhoy @simon @ian but it didn’t understand, it just computed what a likely response would have been, if one had been included in its training set

Show thread

sayrer Feb 6, 2023

@ShadSterling @glyph @jeff @anildash @jwz @amyhoy @simon @ian I agree with what you say about the response, but it did understand that it was to follow up on the previous effort, without any explicit nouns from me. If you try that kind of thing with the various talking cylinders, you will not get that (maybe they're better now, I don't use them daily).

Show thread

Andy Gocke Feb 6, 2023

@glyph @sayrer @jeff @anildash @simon @amyhoy @ian @jwz If the premise is "this is.a tool that produces close, but wrong, answers," I think that could still be useful. Basically useful in any space where verification/fixing is cheaper than authoring. I could probably use it to answer beginner questions, since most of my time with those is spent typing the answer.

Show thread

sayrer Feb 6, 2023

@agocke @glyph @jeff @anildash @simon @amyhoy @ian @jwz oh, precisely. I’m surprised that some others are so hostile here, when they could automate a large variety of repetitive computer questions. these aren’t necessarily stupid questions, but you tend to deal with a lot of the same ones if you are nice enough to answer at all.

Show thread

Andy Gocke Feb 6, 2023

@anildash @jeff @jwz @sayrer @glyph @ian @simon @amyhoy I think it’s still fair to be critical. That use case is far narrower then the marketing. And importantly it never allows you to remove expert oversight

Show thread

Miguel de Icaza ᯅ🍉Feb 6, 2023

@glyph most evolving frameworks suffer from this. You google, and find the recommendations from 5 years ago, and only during a code review someone calls you out for being a Dinosaur

Show thread

Glyph Feb 6, 2023

@Migueldeicaza yes, hence I can’t fault it; this isn’t a *controversial* choice, it’s just the one that most people would choose with some light research. (Heck, probably a bunch of up-to-date docs still describe things this way, it’s hard to do comprehensive updates on a shoestring volunteer budget). Just an example of how LLMs are idea popularity-contest collages and not reasoning beings you can ask for correct answers

Show thread

Joe Mansfield Feb 6, 2023

@glyph @jeff @anildash @simon @amyhoy @ian @sayrer @jwz
One thing we’ve tested is it’s ability to generate a large batch of MCQs on a topic that we (as domain experts) can then rapidly prune down to an effective set of questions. In that context the ‘bullshit fountain’ is very useful as writing effective distractors that are viable but _wrong_ is hard to do but results are relatively easy to check.

Perfect accuracy isn’t necessarily required for every useful activity, just more effectiveness.

Show thread

Jeff C. 🇺🇦🇬🇱Feb 5, 2023

@glyph @anildash @simon @amyhoy @ian @sayrer @jwz I’ve used that “bullshit fountain” to write a reasonably complex Swift command line tool to accept input from the user, interact with data sets on AWS S3, and spit back results.

I verified the code and its output.

I find this “it doesn’t work” stuff to be a bit overly dismissive. It’s useful for SOME things, clearly.

Show thread

Glyph Feb 5, 2023

@jeff @anildash @simon @amyhoy @ian @sayrer @jwz I didn’t say it doesn’t “work”, I said its definition for “good” is unclear. it’s currently useful to produce outputs that correspond to the median internet user writing on a particular topic, regardless of accuracy. Perhaps the median swift programmer can write an AWS CLI with no particular common security errors, in which case I’m sure your code works great.

Show thread

Glyph Feb 5, 2023

@jeff @anildash @simon @amyhoy @ian @sayrer @jwz like the dials on radium watches really did glow! That sliver of utility was not in dispute. But the overall cost-benefit was not worth it. Here, the cost is that once you scale up past trivial examples and convince yourself that it can be unsupervised (or inevitably succumb to review fatigue from the humans in the loop), it will immediately start producing worse quality on more complex tasks

Show thread

jwz Feb 6, 2023

@jeff @glyph @anildash @simon @amyhoy @ian @sayrer
What I (and others) have been saying is not "it doesn't work". I at least am saying:

1) It does not do what you think it does;

2) The thing that you appear to want is absolutely not a thing that it does;

3) It is extremely skilled at lying to you about point #2.

Show thread

Jake Robb Feb 7, 2023

@jwz @jeff @glyph @anildash @simon @amyhoy @ian @sayrer

EXTREMELY skilled. 🤯

https://www.engraved.blog/building-a-virtual-machine-inside/

Building A Virtual Machine inside ChatGPT

Unless you have been living under a rock, you have heard of this new ChatGPT assistant made by OpenAI. Did you know, that you can run a whole virtual machine inside of ChatGPT?

Engraved

Show thread

Cegorach Feb 6, 2023

@glyph well, it's kinda "good" if you actually want bullshit.

Like "letter to X about Y and" Z" or "news-post about X dying" where you'd get a letter with all the default greetings/boilerplate and stuff.

Yes, you'd have to clear up the actual content. But probably less so than if you'd reuse the last letter you'd written, like many people do.

Don't think anything technical is a good application of that bullshit-fountain - but many people spend a lot of time at manually generating bullshit.

Show thread

Daniel Lakeland Feb 6, 2023

@drazraeltod
This right here is the main point isn't it? Think of all those people who are afraid of losing their job to ChatGPT. It's a tacit acknowledgement that their job involves primarily bullshitting. There's even a whole book about that https://en.m.wikipedia.org/wiki/Bullshit_Jobs
@glyph

Bullshit Jobs - Wikipedia

Show thread

Erik Ableson Feb 6, 2023

@glyph @jeff @anildash @simon @amyhoy @ian @sayrer @jwz Yup. This is why people in really niche domains are impressed, because the domain knowledge fed into the system is lots of stuff where the content is all in agreement. But as soon as you get out to general information it all falls apart without being able to add some kind of accuracy scoring