Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs

https://arxiv.org/abs/2601.15714

#HackerNews #EvenGPT5.2 #ZeroErrorHorizons #TrustworthyLLMs #AIResearch #MachineLearning

Even GPT-5.2 Can't Count to Five: The Case for Zero-Error Horizons in Trustworthy LLMs

We propose Zero-Error Horizon (ZEH) for trustworthy LLMs, which represents the maximum range that a model can solve without any errors. While ZEH itself is simple, we demonstrate that evaluating the ZEH of state-of-the-art LLMs yields abundant insights. For example, by evaluating the ZEH of GPT-5.2, we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced. This is surprising given the excellent capabilities of GPT-5.2. The fact that LLMs make mistakes on such simple problems serves as an important lesson when applying LLMs to safety-critical domains. By applying ZEH to Qwen2.5 and conducting detailed analysis, we found that while ZEH correlates with accuracy, the detailed behaviors differ, and ZEH provides clues about the emergence of algorithmic capabilities. Finally, while computing ZEH incurs significant computational cost, we discuss how to mitigate this cost by achieving up to one order of magnitude speedup using tree structures and online softmax.

arXiv.org

RESEARCH SPOTS SURFACE IN DIGITAL TWIN DOMAIN, SIGNALING SHIFT IN ACADEMIC FOCUS

Aalborg University, NTNU, UCL, and QMUL offer new PhDs in digital twins, AI, and LLMs for sustainability and energy. Learn more.

#DigitalTwins, #AIResearch, #PhDLife, #Sustainability, #LLM

https://newsletter.tf/new-phds-digital-twins-ai-sustainability-ucl-qmul/

Several universities like UCL and QMUL are offering new PhDs in digital twins and AI. This is a big increase in research spots for these topics.

#DigitalTwins, #AIResearch, #PhDLife, #Sustainability, #LLM
https://newsletter.tf/new-phds-digital-twins-ai-sustainability-ucl-qmul/

New PhDs in Digital Twins and AI for Sustainability at UCL and QMUL

Aalborg University, NTNU, UCL, and QMUL offer new PhDs in digital twins, AI, and LLMs for sustainability and energy. Learn more.

NewsletterTF

The energy at our recent #AISentience Scholars webinar was incredible!

๐ŸŽฌ You can watch the recording here: https://youtu.be/m-_lmhywBsM?si=lcRyeyThOzAhRPG1 & see the links in the video description for more info.

The excitement is growing; donโ€™t miss your chance to join this unique program! โฐ Apply by 28 April 2026.

#AIResearch #ConsciousnessStudies #EarlyCareerResearchers #AcademicCommunity #Mentorship #ResearchOpportunities #EthicsInAI #AISETS #GraduateStudents #Postdocs #GraduateStudents

AI Sentience Scholars Application Support Webinar

YouTube

fly51fly (@fly51fly)

Google Research์™€ ๋ฎŒํ—จ๊ณต๋Œ€ ์—ฐ๊ตฌ์ง„์ด ๋ชฉํ‘œ ์ •๋ ฌ์„ ์œ„ํ•œ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•์ธ Target-Aligned Reinforcement Learning ๋…ผ๋ฌธ์„ ๊ณต๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. AI ๋ชจ๋ธ์˜ ๋ณด์ƒ ์ •๋ ฌ, ์•ˆ์ „์„ฑ, ํ•™์Šต ์•ˆ์ •์„ฑ ๊ฐœ์„ ์— ๊ด€๋ จ๋œ ์—ฐ๊ตฌ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

https://x.com/fly51fly/status/2039459102313808325

#reinforcementlearning #alignment #googleresearch #airesearch #machinelearning

fly51fly (@fly51fly) on X

[LG] Target-Aligned Reinforcement Learning L S. Pleiss, J Harrison, M Schiffer [Technical University of Munich & Google Research] (2026) https://t.co/S2UjFADiwi

X (formerly Twitter)
Anthropic's widely circulated claim that AI could perform 80% of job tasks rests on a 2023 study with significant limitations. The "theoretical capability" metric comes from research by OpenAI, OpenResearch, and the University of Pennsylvania, which used annotators unfamiliar with specific occupations to estimate where LLMs might reduce task time by at least 50%. Critically, the study makes no timeline predictions and relies on speculative assumptions about "anticipated LLM-powered software" that doesn't yet exist. Current observed AI exposure remains "a fraction of what's feasible," Anthropic acknowledges. https://arstechnica.com/ai/2026/03/how-did-anthropic-measure-ais-theoretical-capabilities-in-the-job-market/ #AIagent #AI #GenAI #AIResearch #Workforce
How did Anthropic measure AI's "theoretical capabilities" in the job market?

2023 study made a lot of assumptions about future "anticipated LLM-powered software."

Ars Technica

After 2 years researching AI-generated misinformation (4 papers at WWW '26), I'm expanding into agentic AI.

Same core question, harder version: how do we maintain trust when AI acts autonomously? When an agent sends emails or books meetings on your behalf, how do you verify it did what you intended?

Open questions: verification at action time, trust calibration, safe agent-tool interface design.

https://alexloth.com/from-misinformation-to-agentic-ai-research-direction-2/

#AgenticAI #AIResearch #Trust

From Misinformation to Agentic AI: Where My Research Is Heading

After two years researching how AI generates misinformation, I am expanding into agentic AI systems. The trust questions are similar but harder: when agents act autonomously, how do we verify intent, calibrate trust, and maintain oversight?

alexloth.com
Liquid AI has released LFM2.5-350M, a compact 350M parameter model trained on 28 trillion tokens that outperforms models more than twice its size. The model uses a hybrid LIV architecture supporting a 32k context window while maintaining a lean memory footprint. https://www.marktechpost.com/2026/03/31/liquid-ai-released-lfm2-5-350m-a-compact-350m-parameter-model-trained-on-28t-tokens-with-scaled-reinforcement-learning/ #AIagent #AI #GenAI #AIResearch #LiquidAI
Liquid AI Released LFM2.5-350M: A Compact 350M Parameter Model Trained on 28T Tokens with Scaled Reinforcement Learning

Liquid AI Released LFM2.5-350M: A Compact 350M Parameter Model Trained on 28T Tokens with Scaled Reinforcement Learning

MarkTechPost
Meta's semi-formal reasoning boosts LLM code review accuracy to 93%. The technique requires AI agents to document premises, trace execution paths, and derive formal conclusions before answering, cutting hallucinations. Tests showed accuracy rising from 78% to 88% on complex examples. https://venturebeat.com/orchestration/metas-new-structured-prompting-technique-makes-llms-significantly-better-at #AIagent #AI #GenAI #AIResearch #Meta

Python Trending (@pythontrending)

Matrix-Game 3.0์ด ๊ณต๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐฉ์‹์˜ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์›”๋“œ ๋ชจ๋ธ๋กœ, ์žฅ๊ธฐ ๊ธฐ์–ต(long-horizon memory)์„ ์ง€์›ํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. AI ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊ณผ ์›”๋“œ ๋ชจ๋ธ ์—ฐ๊ตฌ์—์„œ ์ฃผ๋ชฉํ•  ๋งŒํ•œ ์ƒˆ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค.

https://x.com/pythontrending/status/2038938236764946537

#worldmodel #matrixgame #airesearch #longtermmemory #interactiveai

Python Trending ๐Ÿ‡บ๐Ÿ‡ฆ (@pythontrending) on X

Matrix-Game - Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory https://t.co/urKh6DKnSX

X (formerly Twitter)