Counting as a minimal probe of language model reliability
이 논문은 대형 언어 모델의 신뢰성을 평가하기 위해 Stable Counting Capacity라는 새로운 평가 방식을 제안한다. 이 방식은 반복된 기호를 세는 과제를 통해 모델의 절차적 신뢰성을 측정하며, 기존의 지식 기반 벤치마크와 달리 의미나 모호성을 배제한다. 연구 결과, 현재의 언어 모델들은 광고된 문맥 한계 내에서도 안정적인 카운팅 능력이 부족하며, 실제로는 제한된 내부 상태를 사용해 단순한 규칙을 모방하는 수준임을 보여준다. 이는 언어 모델의 유창한 수행이 반드시 일반적이고 신뢰할 수 있는 규칙 준수를 의미하지 않음을 시사한다.

https://arxiv.org/abs/2605.02028

#languagemodels #modelreliability #counting #proceduralevaluation #nlp

Counting as a minimal probe of language model reliability

Large language models perform strongly on benchmarks in mathematical reasoning, coding and document analysis, suggesting a broad ability to follow instructions. However, it remains unclear whether such success reflects general logical competence, repeated application of learned procedures, or pattern matching that mimics rule execution. We investigate this question by introducing Stable Counting Capacity, an assay in which models count repeated symbols until failure. The assay removes knowledge dependencies, semantics and ambiguity from evaluation, avoids lexical and tokenization confounds, and provides a direct measure of procedural reliability beyond standard knowledge-based benchmarks. Here we show, across more than 100 model variants, that stable counting capacity remains far below advertised context limits. Model behavior is consistent neither with open-ended logic nor with stable application of a learned rule, but instead with use of a finite set of count-like internal states, analogous to counting on fingers. Once this resource is exhausted, the appearance of rule following disappears and exact execution collapses into guessing, even with additional test-time compute. These findings show that fluent performance in current language models does not guarantee general, reliable rule following.

arXiv.org