I expanded the #Vera benchmark to six models across three providers (Anthropic, OpenAI, Moonshot). Kimi K2.5 got every Vera problem right while only managing 86% in Python and 91% in TypeScript. Across the flagship models, Vera and Python are level at 93%.
Yet there is no Vera in any training data, no examples on GitHub or Stack Overflow. Every token generated from a single spec document in the prompt. Language design doing a lot of heavy lifting.








