Mastodawn

Many thanks to @simon and @quinnanya who shared code for using the Turbo API. Incidentally that happened here on Mastodon: Fediverse open science FTW!

@dh Over on the bird site, @dbamman adds BERT as a data point, and it equals or out-performs GPT-4: https://twitter.com/dbamman/status/1637515288584527872

Tweet / Twitter

Twitter

Andy Famiglietti Mar 19, 2023

@TedUnderwood @dh thanks for this, it's a really useful pointer towards method. It's also a helpful reminder that the default "chat" interface (where you just rely on *content* from GPTs training data) is mostly just a demo/party trick. The real power (and, accordingly, the real danger) is using the model's "understanding" of structure on more carefully selected problems and data.

@afamiglietti79 @dh Yep. The model is just a little machine like you might get in a LEGO set. The interesting thing is going to be that we can connect them together in all kinds of different ways and point them at stuff.

Dan Cohen Mar 19, 2023

@TedUnderwood this is fascinating

Robert Roskam Mar 19, 2023

@TedUnderwood @dh

"Reliance on OpenAI is still a bad idea in the long run. Universities should develop their own models and APIs."
💯💯💯

@raiderrobert @TedUnderwood @dh https://medium.com/tales-of-tomorrow/meet-alpaca-an-open-source-chatgpt-alternative-on-the-rise-becfa540e3f6

Colin Rowat Mar 21, 2023

Meet ALPACA, an Open-Source ChatGPT Alternative On The Rise

A new language model called Alpaca was introduced last week. It is small, easy to reproduce, and shows similar capabilities to OpenAI’s ChatGPT or Microsoft’s Bing Chat. Now, since you can’t inspect…

Tales Of Tomorrow

Fotis Jannidis Mar 19, 2023

@TedUnderwood @dh Very interesting report; thanks for sharing. Amazing how quickly it goes from !!😬!! to 'wake up each moment with eternal sunshine of the spotless mind.'

Not sure about your last point:
"To be confident that we’re measuring something called 'suspense' we need to show that multiple people recognize it as suspense."
We can always define a concept and then apply it. The performance of the model is additional feedback on the quality/adequacy of our definition, isn't it?

@fotis_jannidis I'm not totally sure; I would really like to see a longer exploration of this question. In the post I took a stance that is conventional in the social sciences. But that doesn't necessarily mean it's correct, especially with these new models that can sort of "argue back."

felix (grayscale) 🐺Mar 19, 2023

@TedUnderwood @dh This is an interesting application, and I expect more things like this will be useful in the future. But right now, I'm still somewhat skeptical that when you ask GPT for an "explanation of its thoughts", the explanation has any connection to reality. It seems likely to be as much a confabulation/hallucination as anything else, which can be useful, but I don't see how to have confidence that it isn't just hard-to-detect nonsense.

@gray17 @dh I don’t think it has an ability to introspect. But chain-of-thought prompting works because word n+2 is shaped by n and n+1, etc. So the trace is meaningful without any need for introspection

felix (grayscale) 🐺Mar 19, 2023

@TedUnderwood @dh Right, but that works only because of statistical patterns. The "explanation" is derived from the previous text, but it's not clear to me that it won't be misled by patterns that it created itself. It's very easy to fool GPT with eg, a riddle that looks like something it's seen before, but has a small difference that makes the answer completely different.

@gray17 yes, to be sure. But note that the debatable claim I’m making is not about whether the model is *right*—that part I simply measured in the post. The debatable part was that its errors often leave a trace of words. In the example you just provided, for instance, the riddle would be the trace.

@TedUnderwood Sure, when you know the answer, you can tell that the generated text is wrong. What bothers me about these models is that the generated text is almost always plausible, so that if I don't check carefully, I might not notice that it's lying about what it said earlier. "Give an answer and explain your reasoning", sometimes it's obvious if the explanation is about an answer different from what it gave. Sometimes the error is more subtle, and I don't know the implications of that

@gray17 fwiw, the way to prompt these models is not "give your answer and explain your reasoning" but "a) summarize the data relevant to this question b) describe step by step how you would draw inferences from that data, and only then finally c) synthesize those data in a conclusion." In other words you ask it to show the reasoning before it answers — that's *how* it reaches the answer.

@gray17 There is no mental state to describe; it's just a sequence of words and, when it's working properly, the thinking happens *in* the sequence of words.

@TedUnderwood Right, I understand that. But in your experiment, you said, "5. Given the amount of speculation required in step 2, describe your certainty about the estimate--either high, moderate, or low." This "your certainty" is entirely imaginary, I don't know what it *means*

@gray17 It means "describe the level of certainty implicit in your answer to step two." I use the term a human being would use ("your confidence"), because that's how English works. But I'm actually instructing the model to look at the text it has just written and generalize about those words.

@TedUnderwood Right, and that's the point where I don't know that I can reliably detect if it's "lying" about the summary or not.

@TedUnderwood Because it's very good at writing something that always looks plausible. If it were a human, I could gain a model of the human's reliability and attention to detail, but the GPT model is known to be very weird in weird ways. I have to check every thing it says that I don't already know is true.

@gray17 Well, in this case the words are all there on the same page, so I as I scan the answers, I can just ask "is the answer to step 5 consistent with what it said in step 2?" Like, does it speculate a lot in 2 and then weirdly say "high confidence" at the end? And in practice, no it doesn't do that. It can be wrong, but its answers are coherently wrong.

@TedUnderwood But have you studied this other than "scanned a few, seems right to me"? Is it 95% or 99% correct? These models are also known to be vulnerable to adversarial inputs too. How often is that a problem?

I mean, yes, it's useful, but I'm really wary that it's very easy for humans to implicitly go from "it's maybe 95% correct" to "the wording is pretty authoritative, it's probably 100% correct, I'm not going to bother checking"

@gray17 I think we're going in circles. I'm not talking about correctness. Re: correctness I have human estimates for all these passages and can compare them all (every single one), precisely measuring their degree of agreement with human readers--which is close to humans' agreement with each other. But what we're talking about now is the model's habit of talking out loud, and that's not a question of correctness.

@gray17 I know there's been this very influential discourse of "ChatGPT can make stuff up and so you have to check"—which is def true if you're asking it for a citation—but that's not very applicable to what I mean when I say the model leaves a trace of how it got to word 50 in words 1 through 49. That's just how the model works; it's not something it can fake or be wrong about, though sure interpretation may be tricky.

@TedUnderwood yeah, ok. I'm not talking about correctness of that.

in your experimental process, you ask GPT
1. write a chain of reasoning to answer a question
2. rate how "confident" the chain of reasoning seems

and you rely on the rating to improve the prompting. you're checking that the rating makes sense for a few, but you're not checking all of them, so you're implicitly trusting that summary in your feedback cycle. how does that distort the process vs not asking for a confidence rating?

@TedUnderwood maybe this is perfectly ok! but I don't know that a priori, I don't have any reason to think that the GPT rating of its own sentences is any more reliable than anything else it says, without checking them all. and you asked it to do that so you don't have to check them all, you're using it to filter down to things that seem useful. which might be ok! but I don't *know * a strong argument that it *is* ok

@gray17 Actually, I didn't rely on the rating to improve the prompting. I don't care much about the rating. I look at its explanation of what was hard, and then at the passage, to see what was confusing about the passage. You're right that we can't necessarily trust the rating itself as an accurate description of the whole process, for one thing because "high, medium, low" isn't very information-rich. No, the way to assess improvement is "does it get closer to human responses"? Which we know.

@TedUnderwood So why ask for the rating at all?

@gray17 Uh, why not? It's an experiment. Among other things I wanted to see if those ratings did correlate at all with the accuracy of the time estimates. So far I don't think they do.