How do you score a model's confidence when perfect calibration can be gamed by guessing the base rate? EURO scores confidence by the payoff of trusting or abstaining across risk levels, and ACUTE reads the model's activations to decide when to trust an answer. Across six models and three tasks it beats the calibration baselines on this metric while staying well-calibrated, with the biggest gain on tool-calling and none on summarization.

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust – synesis
An activation-based protocol trains lightweight probes on a model’s own hidden states to decide when to trust its output, paired with a utility metric that calibration scores alone can’t game.


