How well can #AI judge reasoning quality?
Rewriting agents' chain-of-thought "style" to *appear* more reflective (without changing action or inference) increased an #LLM judge's false positive rate (by 3% absolute or 18% relative).
How well can #AI judge reasoning quality?
Rewriting agents' chain-of-thought "style" to *appear* more reflective (without changing action or inference) increased an #LLM judge's false positive rate (by 3% absolute or 18% relative).
@ByrdNick Quite interesting, thanks for sharing! In the end, they're still just simulating that they reason... By the way, this monthly blog is focused on LLM evaluation —you might find it interesting.