Mastodawn

Anthropic은 AI 에이전트의 성능을 정확히 측정하기 위한 평가 방법론을 제시했습니다. 기존의 단순 벤치마크를 넘어, 에이전트가 도구를 활용하고 환경을 변화시키는 복잡한 작업을 수행하는 능력을 평가하기 위해 단위 테스트와 통합 테스트를 결합하고 결정론적 채점과 모델 기반 채점을 혼합하는 접근법을 제안합니다.

https://news.hada.io/topic?id=25711

#aiagentevaluation #llmevaluation #modeltesting #anthropic

Anthropic 엔지니어링: AI 에이전트 평가(Evals)의 실용적 가이드와 방법론

<p>요약:</p> <ul> <li>기존 LLM 벤치마크만으로는 도구 사용과 다단계 추론을 수행하는 'AI 에이전트'의 성능을 정확히 측정하기 어려움.</li> <l...

GeekNews

TECHi Apr 21, 2025

OpenAI faces criticism after Epoch AI’s benchmark results show its o3 model performing far below the company's claims. The discrepancy raises concerns about transparency, testing practices, and credibility in AI reporting.

#OpenAI #EpochAI #AITransparency #FrontierMath #AIEthics #ModelTesting #TechAccountability #AIModels #AIResearch #TECHi

Read Full Article :- https://www.techi.com/openai-o3-model-scores-low-benchmark-concerns-raised/