Talk like caveman
Talk like caveman
Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
I agree with this take in general, but I think we need to be prepared for nuance when thinking about these things.
Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.
A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.
More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.
That was my first thought too -- instead of talk like a caveman you could turn off reasoning, with probably better results.
Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.
> much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
What do you mean by that? It’s literally text prediction, isn’t it?
This is condescending and wrong at the same time (best combo).
LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.
Indeed. But I have tried this skill and can confirm that the thinking phase is not impacted. At least in my few attempts it applied the "caveman talk" only to the output, after the initial response was formulated in the thinking process. I used opencode.
You are right, of course, that as such it does not reduce the token usage really. If anything it consumes more tokens because it has to apply the skill on top of the initial result. I do appreciate the conciseness of the output, though :)