This paper shows chain-of-thought faithfulness isn’t a single objective number. On the same data, different classifiers shift scores by up to 30 points and even reverse model rankings. Measurement choice matters more than we admit.

Read the full paper: http://arxiv.org/abs/2603.20172v1