Claude mixes up who said what and that's not OK
https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html
Claude mixes up who said what and that's not OK
https://dwyer.co.za/static/claude-mixes-up-who-said-what-and-thats-not-ok.html
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”
Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.
I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
Yeah, that's usually called "reasoning" or "thinking" tokens AFAIK, so I think the terminology is correct. But from the traces I've seen, they're usually in a sort of diary style and start with repeating the last user requests and tool results. They're not introducing new requirements out of the blue.
Also, they're usually bracketed by special tokens to distinguish them from "normal" output for both the model and the harness.
(They can get pretty weird, like in the "user said no but I think they meant yes" example from a few weeks ago. But I think that requires a few rounds of wrong conclusions and motivated reasoning before it can get to that point - and not at the beginning)