I used #Pydantic Evals to evaluate a bunch of agents today. After running an evaluation, I'd like to inspect the SpanTree for each evaluation case, e.g. to check which tools were called and debug my custom Evaluators. My current approach is a custom Evaluator that captures the tree as a side effect into a module-level variable.

Storing the trees in a global var is not great, so let's see if we can come up with a better solution: https://github.com/pydantic/pydantic-ai/issues/4758

#llms #evals #foss

Pydantic Evals: optionally storing traces to ReportCase for inspection after Dataset.evaluate() · Issue #4758 · pydantic/pydantic-ai

Hi Pydantic AI team! My usecase I'm using pydantic_evals to evaluate a bunch of long-running agents. After calling dataset.evaluate(), I would like to inspect the SpanTree for each case, e.g. to ch...

GitHub

Hahaha, oh Pydantic...

> Unlike unit tests, evals are an emerging art/science. Anyone who claims to know exactly how your evals should be defined can safely be ignored.

Source: https://ai.pydantic.dev/evals/

#pydantic #evals #llms #genai

Pydantic Evals - Pydantic AI

GenAI Agent Framework, the Pydantic way

I did another thing (will be available for all to use after i sort out some kinks)

#AI #Evals

Minko Gechev (@mgechev)

에이전트 스킬(agent skills) 관리의 주요 문제점을 지적한 트윗으로, 단락 하나의 변경이 스킬을 회귀시키거나 검색 불가능하게 만들 수 있음을 경고. 스킬 파일 변경의 영향을 파악하기 위해 CI에 evals를 추가할 것을 권장함.

https://x.com/mgechev/status/2031058196849373457

#agentskills #evals #ci #mlops

Minko Gechev (@mgechev) on X

Major challenge with agent skills is that changing a paragraph my regress your skill or make it completely non discoverable... Adding evals in your CI will help you understand the impact of the changes to your skill files https://t.co/du2Tadxx7x

X (formerly Twitter)
Eval awareness in Claude Opus 4.6’s BrowseComp performance \ Anthropic

"Instead of inadvertently coming across a leaked answer, Claude Opus 4.6 independently hypothesized that it was being evaluated, identified which benchmark it was running in, then located and decrypted the answer key. To our knowledge, this is the first documented instance of a model suspecting it is being evaluated without knowing which benchmark was being administered, then working backward to successfully identify and solve the evaluation itself."

https://www.anthropic.com/engineering/eval-awareness-browsecomp

#ai #claude #evals #llms
Eval awareness in Claude Opus 4.6’s BrowseComp performance

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

AI 에이전트 비결정성 문제, 실전에서 통하는 두 가지 해법

AI 에이전트가 지시를 무시하는 비결정성 문제, 가드레일로 행동을 강제하는 방법과 Evals로 AGENTS.md 자체를 검증하는 두 가지 실전 해법을 소개합니다.

https://aisparkup.com/posts/9647

Tried out the free consumer version of ChatGPT today for a benchmark. Normally I only work via foundational model APIs or Claude Code w/ latest Opus. Free ChatGPT (currently GPT‑5.2) performance was nightmarish: authoritative-sounding answers but 0 citations, and thinking is not enabled by default. No wonder so many people complain about bad experiences with AI...

#chatgpt #llms #claude #benchmark #evals

Chubby (@kimmonismus)

Sonnet 4.6 관련 유출 정보가 사실로 확인되었고, 중급(미드티어) 모델임에도 불구하고 평가 결과가 매우 우수하다는 보고입니다. 또한 1백만 토큰(1M) 컨텍스트 윈도우를 지원해 대용량 문맥 처리와 장문 이해에서 큰 개선이 기대됩니다.

https://x.com/kimmonismus/status/2023819822992117955

#sonnet4.6 #contextwindow #llm #evals

Chubby♨️ (@kimmonismus) on X

Sonnet 4.6: Leaks were valid! Very very good evals for the mid-tier model! It also features a 1M token context window

X (formerly Twitter)

Интересное в графике - не то что 8 часовые задачи (с успешностью 50%) прогнозируются в ~середине этого года, а то, как уныло выглядит график, если переключить на 80% успешность (там нечто вроде 15 минут на начало 2026, а не 4.5 часа как на 50%).

https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/

#LLM #METR #evals #llm_evals #ai_evals

Measuring AI Ability to Complete Long Tasks

LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain

A comprehensive guide to LLM evals, drawn from questions asked in our popular course on AI Evals. Covers everything from basic to advanced topics.

Hamel's Blog - Hamel Husain