Mastodawn

How many of us are evaling our skills?

Apastra는 AI 에이전트의 프롬프트와 스킬을 로컬에서 평가할 수 있는 경량화된 평가 프레임워크입니다. YAML과 JSONL 기반의 명세로 프롬프트, 데이터셋, 평가자, 테스트 스위트를 정의하며, 단위 테스트처럼 프롬프트 동작을 반복 검증할 수 있습니다. GitHub Actions와 연동한 자동 회귀 테스트도 지원해 품질 저하를 사전에 감지할 수 있습니다. 언어 독립적이며 Python 런타임을 포함해 간단히 설치해 바로 사용할 수 있어 AI 에이전트 개발과 운영에 유용합니다.

https://github.com/BintzGavin/apastra

#aievaluation #prompttesting #agentdevelopment #regressiontesting #apastra

GitHub - BintzGavin/apastra: Lightweight prompt versioning, evals, benchmarks, and delivery

Lightweight prompt versioning, evals, benchmarks, and delivery - BintzGavin/apastra

GitHub

sayzard 2d ago

Pitch-Pit – AI rates your startup idea, crowd votes, top one gets built
Pitch-Pit은 스타트업 아이디어를 AI와 커뮤니티가 함께 평가하여 최고 점수를 받은 아이디어를 실제 MVP로 제작해주는 플랫폼입니다. 사용자는 주당 두 개의 아이디어를 제출할 수 있으며, AI가 YC 오피스 아워스 평가 기준에 따라 6가지 차원에서 점수를 매기고, 커뮤니티 투표와 합산하여 최종 점수를 산출합니다. 선정된 아이디어는 무료로 개발되어 제출자 이름으로 공개됩니다. 별도의 인터뷰나 데크 없이 간단히 아이디어를 제출하고 평가받을 수 있어 창업 초기 아이디어 검증과 실행에 유용합니다.

https://pitchpit.app

#startup #aievaluation #mvp #crowdvoting #innovation

pitch-pit · weekly startup idea contest

Submit a startup idea. Claude rates it YC-style. Each week's winner gets built — for free.

pitch-pit

Doug Holton Apr 14

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
https://arxiv.org/abs/2602.10620
Code & data: https://github.com/codingchild2424/isd-agent-benchmark
"benchmark comprising 25,795 scenarios that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model."

w/same author: Pedagogy-R1: Pedagogical Large Reasoning Model and Well-balanced Educational Benchmark https://dl.acm.org/doi/10.1145/3746252.3761133
#AIEd #LearningDesign #AIevaluation #EdTech

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents

Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \& Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.

arXiv.org

Winbuzzer Apr 6

https://winbuzzer.com/2026/04/06/google-study-ai-benchmarks-ignore-human-disagreement-xcxwbn/

Google Study: AI Benchmarks Use Too Few Raters to Be Reliable

#AI #Google #GoogleResearch #AIBenchmarks #AIResearch #MachineLearning #LMArena #ChatbotArena #BigTech #RochesterInstituteOfTechnology #AIEvaluation

Marcus Schuler Apr 3

Implicator.ai released the AI Top 40, a weekly ranking that combines 10 benchmarks into one score per language model. The system weights contamination-resistant tests like SWE-bench 4x higher than Chatbot Arena. GPT-5.4 currently leads despite Claude topping Arena rankings. Updates every Saturday and offers free embedding for websites.

#AIBenchmarks #LanguageModels #AIEvaluation

https://www.implicator.ai/implicator-ai-launches-the-ai-top-40-ranking-llms-across-10-benchmarks-in-one-score/

AI Top 40 Launches, Ranking LLMs Across 10 Benchmarks

The AI Top 40 ranks language models by aggregating 10 benchmarks into one score. GPT-5.4 leads despite Claude topping Arena, because the system weights rigorous tests four times higher.

Implicator.ai

NewsletterTF Apr 2

Estonian Language in AI's Grasp: A Struggle for Authenticity

New benchmark tests Estonian AI language. AI sounds unnatural and 'wooden'. Researchers want AI to sound like real people. This affects Estonian chatbot users.

#EstonianAI, #LanguageTech, #AIEvaluation, #SmallLanguage, #UniversityOfTartu

https://newsletter.tf/estonian-ai-language-sounds-unnatural-new-test/

NewsletterTF Apr 2

AI talking in Estonian sounds 'wooden' and unnatural, unlike real people. A new test shows this problem is still big.

#EstonianAI, #LanguageTech, #AIEvaluation, #SmallLanguage, #UniversityOfTartu
https://newsletter.tf/estonian-ai-language-sounds-unnatural-new-test/

Estonian AI Language Sounds Unnatural, New Test Shows

New benchmark tests Estonian AI language. AI sounds unnatural and 'wooden'. Researchers want AI to sound like real people. This affects Estonian chatbot users.

NewsletterTF

sayzard Mar 28

Design Arena (@Designarena)

Audio Arena를 공개했습니다. 기존 음성 벤치마크가 포화에 가까워진 상황에서, speech-to-speech 모델을 현실적인 시나리오로 스트레스 테스트할 수 있는 6개의 정적 멀티턴 벤치마크를 오픈소스로 배포했습니다.

https://x.com/Designarena/status/2037622861897368006

#audio #benchmark #speechtospeech #opensource #aievaluation

Design Arena (@Designarena) on X

Introducing Audio Arena Most existing voice benchmarks are approaching saturation - frontier models are scoring 90%+ on nearly every category. Today we've open-sourced a suite of 6 static multi-turn benchmarks designed to stress-test speech-to-speech models on realistic

X (formerly Twitter)

Beth Pariseau Mar 16

Start your week off right with #enterpriseAI #changemanagement tips from IT leaders Juan Orlandini, Fabien CROS, Kulvir Gahunia and Dana Harrison. My in-depth look at how #gamification, #AIevaluation platforms, #platformengineering and other approaches helped companies such as Insight, Ducker Carlisle and TELUS adopt #AI effectively: https://www.techtarget.com/searchitoperations/news/366640354/IT-leaders-share-enterprise-AI-change-management-tips

AI Daily Post Mar 9

Google Stax just turned its LLM into a judge, automatically scoring model outputs against your own criteria. This opens up open‑source benchmarking, letting developers run fast, reproducible evaluations without hand‑crafting metrics. Curious how it works and what it means for AI research? Dive in for the details. #LLMasJudge #AIevaluation #GoogleStax #PromptBenchmarking

🔗 https://aidailypost.com/news/google-stax-uses-llm-as-judge-autoevaluate-model-outputs-by-your