Mastodawn

felix (grayscale) 🐺Mar 19, 2023

@TedUnderwood @dh This is an interesting application, and I expect more things like this will be useful in the future. But right now, I'm still somewhat skeptical that when you ask GPT for an "explanation of its thoughts", the explanation has any connection to reality. It seems likely to be as much a confabulation/hallucination as anything else, which can be useful, but I don't see how to have confidence that it isn't just hard-to-detect nonsense.

Ted Underwood Mar 19, 2023

@gray17 @dh I don’t think it has an ability to introspect. But chain-of-thought prompting works because word n+2 is shaped by n and n+1, etc. So the trace is meaningful without any need for introspection

felix (grayscale) 🐺Mar 19, 2023

@TedUnderwood @dh Right, but that works only because of statistical patterns. The "explanation" is derived from the previous text, but it's not clear to me that it won't be misled by patterns that it created itself. It's very easy to fool GPT with eg, a riddle that looks like something it's seen before, but has a small difference that makes the answer completely different.

@gray17 yes, to be sure. But note that the debatable claim I’m making is not about whether the model is *right*—that part I simply measured in the post. The debatable part was that its errors often leave a trace of words. In the example you just provided, for instance, the riddle would be the trace.

@TedUnderwood Sure, when you know the answer, you can tell that the generated text is wrong. What bothers me about these models is that the generated text is almost always plausible, so that if I don't check carefully, I might not notice that it's lying about what it said earlier. "Give an answer and explain your reasoning", sometimes it's obvious if the explanation is about an answer different from what it gave. Sometimes the error is more subtle, and I don't know the implications of that

@gray17 fwiw, the way to prompt these models is not "give your answer and explain your reasoning" but "a) summarize the data relevant to this question b) describe step by step how you would draw inferences from that data, and only then finally c) synthesize those data in a conclusion." In other words you ask it to show the reasoning before it answers — that's *how* it reaches the answer.

@gray17 There is no mental state to describe; it's just a sequence of words and, when it's working properly, the thinking happens *in* the sequence of words.

@TedUnderwood Right, I understand that. But in your experiment, you said, "5. Given the amount of speculation required in step 2, describe your certainty about the estimate--either high, moderate, or low." This "your certainty" is entirely imaginary, I don't know what it *means*

@gray17 It means "describe the level of certainty implicit in your answer to step two." I use the term a human being would use ("your confidence"), because that's how English works. But I'm actually instructing the model to look at the text it has just written and generalize about those words.

@TedUnderwood Right, and that's the point where I don't know that I can reliably detect if it's "lying" about the summary or not.

@TedUnderwood Because it's very good at writing something that always looks plausible. If it were a human, I could gain a model of the human's reliability and attention to detail, but the GPT model is known to be very weird in weird ways. I have to check every thing it says that I don't already know is true.

@gray17 Well, in this case the words are all there on the same page, so I as I scan the answers, I can just ask "is the answer to step 5 consistent with what it said in step 2?" Like, does it speculate a lot in 2 and then weirdly say "high confidence" at the end? And in practice, no it doesn't do that. It can be wrong, but its answers are coherently wrong.

@TedUnderwood But have you studied this other than "scanned a few, seems right to me"? Is it 95% or 99% correct? These models are also known to be vulnerable to adversarial inputs too. How often is that a problem?

I mean, yes, it's useful, but I'm really wary that it's very easy for humans to implicitly go from "it's maybe 95% correct" to "the wording is pretty authoritative, it's probably 100% correct, I'm not going to bother checking"

@gray17 I think we're going in circles. I'm not talking about correctness. Re: correctness I have human estimates for all these passages and can compare them all (every single one), precisely measuring their degree of agreement with human readers--which is close to humans' agreement with each other. But what we're talking about now is the model's habit of talking out loud, and that's not a question of correctness.

@TedUnderwood yeah, ok. I'm not talking about correctness of that.

in your experimental process, you ask GPT
1. write a chain of reasoning to answer a question
2. rate how "confident" the chain of reasoning seems

and you rely on the rating to improve the prompting. you're checking that the rating makes sense for a few, but you're not checking all of them, so you're implicitly trusting that summary in your feedback cycle. how does that distort the process vs not asking for a confidence rating?

@TedUnderwood maybe this is perfectly ok! but I don't know that a priori, I don't have any reason to think that the GPT rating of its own sentences is any more reliable than anything else it says, without checking them all. and you asked it to do that so you don't have to check them all, you're using it to filter down to things that seem useful. which might be ok! but I don't *know * a strong argument that it *is* ok

@gray17 Actually, I didn't rely on the rating to improve the prompting. I don't care much about the rating. I look at its explanation of what was hard, and then at the passage, to see what was confusing about the passage. You're right that we can't necessarily trust the rating itself as an accurate description of the whole process, for one thing because "high, medium, low" isn't very information-rich. No, the way to assess improvement is "does it get closer to human responses"? Which we know.

@TedUnderwood So why ask for the rating at all?