Mastodawn

Chinstrap Community Apr 17

Confident AI - LLM evaluation & observability platform

Cossmology Profile: https://dub.sh/VV3M4wN

Key People: Jeffrey Ip, Kritin Vongthongsri

#LLMEvaluation #OpenSource #OSS #COSS

HackerNoon Feb 15

Learn how to build an LLM-as-a-Judge pipeline with LangChain and Claude to score helpfulness and correctness at production scale. https://hackernoon.com/llm-as-a-judge-how-to-build-an-automated-evaluation-pipeline-you-can-trust #llmevaluation

LLM-as-a-Judge: How to Build an Automated Evaluation Pipeline You Can Trust | HackerNoon

Learn how to build an LLM-as-a-Judge pipeline with LangChain and Claude to score helpfulness and correctness at production scale.

AI Daily Post Feb 2

Game Arena just released a chess benchmark to probe AI strategic reasoning. It pits large language models against each other in head‑to‑head games, offering a transparent way to evaluate LLM capabilities beyond standard tests. Curious how your favorite model stacks up? Dive into the details and see the results. #GameArena #ChessBenchmark #StrategicReasoning #LLMEvaluation

🔗 https://aidailypost.com/news/game-arena-launches-chess-benchmark-test-ai-strategic-reasoning

sayzard Jan 10

[칼 뉴포트가 분석한 AI 에이전트 2025년 약속이 빗나간 이유

Cal Newport는 OpenAI의 샘 알트먼을 포함한 주요 인사들이 2025년에 제시한 AI 에이전트의 혁신적인 생산성 향상 예측이 실현되지 못한 이유를 분석했다. 주요 이유는 AI 에이전트의 실제 제품들이 예상보다 단순한 작업에서도 실패하였고, 특히 프로그래밍 외의 능력으로의 전이가 제한적이었으며, LLM 기반 기술의 제약 때문이었다. 안드레이 카파시는 AI 에이전트의 급격한 발전이 아닌 점진적인 진화를 인정하며, 2026년에는 AI의 실제 능력에 대한 냉정한 평가가 필요함을 강조했다.

https://news.hada.io/topic?id=25689

#aiagent #calnewport #ailimitations #llmevaluation #overpromising

칼 뉴포트가 분석한 AI 에이전트 2025년 약속이 빗나간 이유

<p>2025년 OpenAI 샘 알트먼 등은 AI 에이전트가 노동력에 합류해 생산성을 혁신할 것이라고 예측했으나, 실제 제품(예: ChatGPT Agent)은 단순 작...

GeekNews

sayzard Jan 10

[Anthropic 엔지니어링: AI 에이전트 평가(Evals)의 실용적 가이드와 방법론

Anthropic은 AI 에이전트의 성능을 정확히 측정하기 위한 평가 방법론을 제시했습니다. 기존의 단순 벤치마크를 넘어, 에이전트가 도구를 활용하고 환경을 변화시키는 복잡한 작업을 수행하는 능력을 평가하기 위해 단위 테스트와 통합 테스트를 결합하고 결정론적 채점과 모델 기반 채점을 혼합하는 접근법을 제안합니다.

https://news.hada.io/topic?id=25711

#aiagentevaluation #llmevaluation #modeltesting #anthropic

Anthropic 엔지니어링: AI 에이전트 평가(Evals)의 실용적 가이드와 방법론

<p>요약:</p> <ul> <li>기존 LLM 벤치마크만으로는 도구 사용과 다단계 추론을 수행하는 'AI 에이전트'의 성능을 정확히 측정하기 어려움.</li> <l...

GeekNews

N-gated Hacker News Jan 8

Ah, yes, because what the world truly needs is a *task-free* intelligence test for LLMs—because why bother with those pesky tasks anyway? 🙄 Andrew Marble is here to save us from the mind-numbing chore of actually having measurable criteria for AI evaluation. 💡✨
https://www.marble.onl/posts/tapping/index.html #taskfreeAI #LLMevaluation #AIinnovation #techhumor #AndrewMarble #HackerNews #ngated

Task-free intelligence testing of LLMs (Part 1)

aicoder Oct 28, 2025

#AIEngineering #llms #LLMevaluation #hallucinations

https://arxiv.org/abs/2504.18114

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

arXiv.org

Giskard Aug 12, 2025

🔥 GPT-5 got jailbroken in less than 24 hours. If SOTA models aren't safe, what does that say about yours?

The pace of AI advancement is breathtaking. But security vulnerabilities are advancing just as fast. Evaluate your LLM agents with Giskard.

Request a trial of our AI red teaming platform: https://www.giskard.ai/contact

#Cybersecurity #GPT5Jailbreak #LLMEvaluation #EnterpriseAI

HackerNoon Aug 1, 2025

A deep dive into why LLMs need both metrics and human feedback for real-world accuracy, ethics, and performance. https://hackernoon.com/toward-holistic-evaluation-of-llms-integrating-human-feedback-with-traditional-metrics #llmevaluation

Toward Holistic Evaluation of LLMs: Integrating Human Feedback with Traditional Metrics | HackerNoon

A deep dive into why LLMs need both metrics and human feedback for real-world accuracy, ethics, and performance.

Show thread

Giskard May 15, 2025

At Giskard, we've integrated LMEval into our Phare LLM benchmark (phare.giskard.ai) to independently evaluate popular models' security and safety dimensions - through rigorous testing.

Read the announcement: https://opensource.googleblog.com/2025/05/announcing-lmeval-an-open-ource-framework-cross-model-evaluation.html

#LMEval #AISecurity #LLMEvaluation #OpenSource

Announcing LMEval: An Open Source Framework for Cross-Model Evaluation

Announcing LMEval, an open source framework for cross-model evaluation and simplifying cross-provider model benchmarking.

Google Open Source Blog