Mastodawn

https://winbuzzer.com/2026/04/06/google-study-ai-benchmarks-ignore-human-disagreement-xcxwbn/

Google Study: AI Benchmarks Use Too Few Raters to Be Reliable

#AI #Google #GoogleResearch #AIBenchmarks #AIResearch #MachineLearning #LMArena #ChatbotArena #BigTech #RochesterInstituteOfTechnology #AIEvaluation

AI Sparkup Feb 12

Chatbot Arena 1위 모델, 단 2표로 바뀐다는 MIT 연구 결과

MIT 연구진이 발견한 LLM 랭킹 플랫폼의 충격적 취약성. 57,000표 중 단 2표만 제거해도 1위 모델이 바뀌는 현상과 그 의미를 분석합니다.

https://aisparkup.com/posts/9131

Winbuzzer May 21, 2025

LMArena Gets $100M at $600M Valuation for AI Model Testing

#AI #LMArena #AIFunding #ChatbotArena #AIBenchmarks #UCBerkeley

https://winbuzzer.com/2025/05/21/lmarena-gets-100m-at-600m-valuation-for-ai-model-testing-xcxwbn/

Miguel Afonso Caetano May 1, 2025

"Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena’s evaluation framework and promote fairer, more transparent benchmarking for the field."

https://arxiv.org/abs/2504.20879

#AI #GenerativeAI #LLMs #Chatbots #ChatbotArena #Llama #Meta #OpenSource

The Leaderboard Illusion

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

arXiv.org

Winbuzzer Apr 22, 2025

Experts Challenge Validity and Ethics of Crowdsourced AI Benchmarks Like LMArena (Chatbot Arena)

#AI #AIBenchmarks #AIModels #LMArena #ChatbotArena #AIethics #LLMs #AIEvaluation #Crowdsourcing #GenAI

https://winbuzzer.com/2025/04/22/experts-challenge-validity-and-ethics-of-crowdsourced-ai-benchmarks-like-lmarena-chatbot-arena-xcxwbn/

Winbuzzer Apr 18, 2025

AI Benchmarking Platform Chatbot Arena Forms New Company, Launches LMArena

#AI #GenAI #LLMs #AIChatbots #LMArena #ChatbotArena #AIBenchmarks #AIModels #AIevaluation

https://winbuzzer.com/2025/04/18/ai-benchmarking-platform-chatbot-arena-forms-new-company-launches-lmarena-xcxwbn/

rphle Mar 12, 2025

Wow! I didn't really like Gemma 2, but Gemma 3, released today, is awesome. It comes in four sizes, 1b, 4b, 12b and 27b. It's super fast and except for the 1b version it can even handle images.

The 27B version apparently outperforms both DeepSeek v3 and LLaMA3-405 on the ChatbotArena benchmark.

It's also the first small model I've tested that's good at German.

#gemma3 #gemma #gemma2 #google #ai #programming #google #model #local #gemini #multimodal #vision #wow #chatbotarena #german

Moreno Colaiacovo 🇮🇹Feb 27, 2025

#ChatbotArena Italia è una piattaforma che ha l'obiettivo di comparare e valutare i Large Language Models sulla lingua italiana. 🤖🇮🇹

Se volete partecipare, basta sottoporre un prompt a due modelli #AI scelti a caso dal sistema e votare la migliore. C'è anche la classifica!

https://indigo.ai/it/chatbot-arena-italia/

indigo.ai | Chatbot Arena Italia

Matthias Schüssler Feb 23, 2025

Die #Top5 diese Woche im Blog:
5️⃣ #TheStoryGraph als Alternative zu #Goodreads: https://blog.clickomania.ch/2025/02/18/the-storygraph-book-recommendations-review/
4️⃣ Vergleich von LLMs in der #ChatbotArena: https://blog.clickomania.ch/2025/02/20/lmarena-ai-llm-comparison-platform-review/
3️⃣ Eine Kult-Website! https://blog.clickomania.ch/2025/02/21/wikifeet-com-review-and-critical-acclaim/
2️⃣ Wie #BillGates in der Schweiz lange nicht ernst genommen wurde: https://blog.clickomania.ch/2025/02/19/erste-erwaehnung-bill-gates-in-den-schweizer-medien/
1️⃣ #LeChat von Mistral brilliert beim Karin-Keller-Sutter-Test: https://blog.clickomania.ch/2025/02/17/mistral-le-chat-review/
#clickomaniach

Nachschub fürs Nachttischchen – Clickomania

The Story Graph ist eine Alternative zur Bücherplattform Goodreads, die nicht Amazon gehört und die brauchbare Lektüre-Empfehlungen liefert.

Matthias Schüssler Feb 20, 2025

Eine globale Rangliste der KI-Sprachmodelle – und die Möglichkeit, mehrere LLMs blind zu vergleichen: Beides gibt es auf #ChatbotArena. Tipp: Hier war Deepseek zu entdecken, bevor der mediale Hype ausgebrochen ist.
https://blog.clickomania.ch/2025/02/20/lmarena-ai-llm-comparison-platform-review/
#clickomaniach

Treffen sich zwei Sprachmodelle in einem Boxring … – Clickomania

Bei Chatbot Arena hetzen wir 86 Sprachmodelle aufeinander los. Es gibt eine globale Rangliste und interessante Einblicke darüber, welche KIs auf dem Aufstieg sind.