"Reliance on OpenAI is still a bad idea in the long run. Universities should develop their own models and APIs."
💯💯💯
@TedUnderwood @dh Very interesting report; thanks for sharing. Amazing how quickly it goes from !!😬!! to 'wake up each moment with eternal sunshine of the spotless mind.'
Not sure about your last point:
"To be confident that we’re measuring something called 'suspense' we need to show that multiple people recognize it as suspense."
We can always define a concept and then apply it. The performance of the model is additional feedback on the quality/adequacy of our definition, isn't it?
@TedUnderwood But have you studied this other than "scanned a few, seems right to me"? Is it 95% or 99% correct? These models are also known to be vulnerable to adversarial inputs too. How often is that a problem?
I mean, yes, it's useful, but I'm really wary that it's very easy for humans to implicitly go from "it's maybe 95% correct" to "the wording is pretty authoritative, it's probably 100% correct, I'm not going to bother checking"
@TedUnderwood yeah, ok. I'm not talking about correctness of that.
in your experimental process, you ask GPT
1. write a chain of reasoning to answer a question
2. rate how "confident" the chain of reasoning seems
and you rely on the rating to improve the prompting. you're checking that the rating makes sense for a few, but you're not checking all of them, so you're implicitly trusting that summary in your feedback cycle. how does that distort the process vs not asking for a confidence rating?