Start your week off right with #enterpriseAI #changemanagement tips from IT leaders Juan Orlandini, Fabien CROS, Kulvir Gahunia and Dana Harrison. My in-depth look at how #gamification, #AIevaluation platforms, #platformengineering and other approaches helped companies such as Insight, Ducker Carlisle and TELUS adopt #AI effectively: https://www.techtarget.com/searchitoperations/news/366640354/IT-leaders-share-enterprise-AI-change-management-tips

Google Stax just turned its LLM into a judge, automatically scoring model outputs against your own criteria. This opens up open‑source benchmarking, letting developers run fast, reproducible evaluations without hand‑crafting metrics. Curious how it works and what it means for AI research? Dive in for the details. #LLMasJudge #AIevaluation #GoogleStax #PromptBenchmarking

🔗 https://aidailypost.com/news/google-stax-uses-llm-as-judge-autoevaluate-model-outputs-by-your

Artificial Analysis (@ArtificialAnlys)

Claude Sonnet 4.6이 Artificial Analysis Intelligence Index에서 Opus 4.6에 이어 2위를 차지했다는 보고입니다. Sonnet 4.6은 최대 노력 모드에서 4.5보다 출력 토큰을 약 3배 더 사용했으며, GDPval-AA와 TerminalBench에서는 모든 모델을 선도해 Opus 4.6을 근소하게 앞서는 결과를 보였습니다. 성능·효율 비교 정보입니다.

https://x.com/ArtificialAnlys/status/2024259812176121952

#claude #sonnet4.6 #opus4.6 #benchmarks #aievaluation

Artificial Analysis (@ArtificialAnlys) on X

Claude Sonnet 4.6 takes second place in the Artificial Analysis Intelligence Index (behind Opus 4.6), but used ~3x more output tokens than Claude Sonnet 4.5 in its max effort mode. Sonnet 4.6 leads all models in GDPval-AA and TerminalBench, including a slight lead over Opus 4.6

X (formerly Twitter)

Artificial Analysis (@ArtificialAnlys)

MiniMax가 M2.1 대비 +2점 향상된 MiniMax-M2.5 모델을 공개했습니다. Artificial Analysis Intelligence Index와 GDPval-AA 점수는 상승했으나 AA-Omniscience 평가에서는 환각률(hallucination rate)이 더 높아진 것으로 보고되어 성능 향상과 부작용(환각) 증가가 병행된 업데이트입니다.

https://x.com/ArtificialAnlys/status/2022476857896218925

#minimax #llm #modelrelease #aievaluation

Artificial Analysis (@ArtificialAnlys) on X

MiniMax has released MiniMax-M2.5, an incremental upgrade over M2.1, up +2 points in the Artificial Analysis Intelligence Index, supported by a higher GDPval-AA score but the model also has a higher hallucination rate in AA-Omniscience MiniMax-M2.5 with an Intelligence Index

X (formerly Twitter)

Một nhà phát triển vừa tạo công cụ đánh giá mã nguồn mở (SanityHarness) và kiểm tra 49 cặp mô hình/đại lý lập trình, bao gồm Kimi K2.5. Bảng xếp hạng SanityBoard chấm điểm hiệu năng, chi phí và so sánh các mô hình hỗ trợ BYOK. Phát hiện: Codebuff mắc nhưng hiệu suất kém, Droid và Minimax vượt trội. Mời cộng đồng tham gia thử nghiệm qua Discord. #AI #LậpTrình #ĐánhGiáAI #MãNguồnMở #Coding #AIEvaluation

https://www.reddit.com/r/LocalLLaMA/comments/1qp4ftj/i_made_a_coding_eval_and_ran_it_against_4

TrustifAI – Khung đánh giá độ tin cậy cho hệ thống AI/RAG với điểm số đa chiều: Độ bao phủ bằng chứng, Độ ổn định luận lý, Độ lệch ngữ nghĩa, Đa dạng nguồn, Độ tự tin sinh nội dung. Tạo đồ thị lý lẽ & trực quan hóa Mermaid để truy xuất nguyên nhân. Giải pháp cho môi trường doanh nghiệp, quản trị & tuân thủ. #TrustifAI #RAG #AIEvaluation #AIinVietnam #ĐánhGiáAI #HệThốngThôngMinh

https://www.reddit.com/gallery/1qmhvuz

Nếu AI phải giải thích về startup của bạn, nó sẽ nói gì? Không phải những gì bạn muốn nói, mà là thứ AI học được: bạn giống ai, thuộc nhóm nào, hoặc liệu bạn có bị bỏ qua hoàn toàn. Kiểm tra ngay trước khi thị trường định nghĩa bạn! #AI #Startup #Founder #AIevaluation #KhởiNghiệp #ĐịnhVịThươngHiệu

https://www.reddit.com/r/SaaS/comments/1qkrwbm/if_ai_had_to_explain_your_startup_tomorrow_what/

Data contamination threatens #LLM #AIEvaluation Scaling has “limits to growth”. New #ARCAGI2 counters this problem with contamination resistant, compositional reasoning tests and human baselines require original reasoning Not just memory recall evaluation arxiv.org/abs/2505.11831

ARC-AGI-2: A New Challenge for...
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), introduced in 2019, established a challenging benchmark for evaluating the general fluid intelligence of artificial systems via a set of unique, novel tasks only requiring minimal prior knowledge. While ARC-AGI has spurred significant research activity over the past five years, recent AI progress calls for benchmarks capable of finer-grained evaluation at higher levels of cognitive complexity. We introduce ARC-AGI-2, an upgraded version of the benchmark. ARC-AGI-2 preserves the input-output pair task format of its predecessor, ensuring continuity for researchers. It incorporates a newly curated and expanded set of tasks specifically designed to provide a more granular signal to assess abstract reasoning and problem-solving abilities at higher levels of fluid intelligence. To contextualize the difficulty and characteristics of ARC-AGI-2, we present extensive results from human testing, providing a robust baseline that highlights the benchmark's accessibility to human intelligence, yet difficulty for current AI systems. ARC-AGI-2 aims to serve as a next-generation tool for rigorously measuring progress towards more general and human-like AI capabilities.

arXiv.org

Artificial Analysis (@ArtificialAnlys)

Artificial Analysis가 Intelligence Index v4.0을 발표했습니다. 이번 버전은 3개의 신규 평가를 도입해 실사용 사례에 더 정렬하고 포화(saturation)를 줄이도록 설계되었습니다. 해당 지표는 범용(Generalist) 모델의 성능을 종합적으로 평가하기 위한 합성 메트릭으로 소개됩니다.

https://x.com/ArtificialAnlys/status/2008570646897573931

#artificialanalysis #intelligenceindex #benchmark #aievaluation

Artificial Analysis (@ArtificialAnlys) on X

New year, new Artificial Analysis Intelligence Index! Announcing Intelligence Index v4.0: incorporating 3 new evaluations, further aligning to real-word use and reducing saturation The Artificial Analysis Intelligence Index is our synthesis metric for assessing generalist model

X (formerly Twitter)