New benchmark reveals that top multimodal models still stumble below 50% accuracy on basic visual entity tasks. The gap highlights limits in current vision‑language training and raises questions about real‑world reliability. Dive into the findings and what they mean for future AI research. #MultimodalLearning #VisionLanguage #EntityRecognition #AIBenchmarking

🔗 https://aidailypost.com/news/top-multimodal-models-fail-exceed-50-accuracy-basic-visual-entity

TechFollow (@TechFollowrazzi)

Micah Hill-Smith는 ArtificialAnlys의 공동창업자 겸 CEO로, 독립적인 AI 벤치마킹 플랫폼을 운영해 팀들이 특정 사용 사례에 맞는 최적의 모델과 API 제공자를 선택하도록 도와줍니다. 모델 평가·비교에 특화된 서비스라는 점이 강조됩니다.

https://x.com/TechFollowrazzi/status/2017990678022676755

#aibenchmarking #modelevaluation #aitools #mlops

TechFollow (@TechFollowrazzi) on X

🚨 @swyx followed @_micah_h Micah Hill-Smith, Co-founder & CEO of @ArtificialAnlys, runs an independent AI benchmarking platform that helps teams pick the best models and API providers for their use cases.

X (formerly Twitter)

Samsung just dropped TRUEBench, a new benchmark designed to actually measure how useful enterprise AI models are in the real world, not just how smart they sound on paper. Multilingual, real-task focused, and even co-developed by AI.

Finally, a benchmark that speaks fluent business. Check out the details: https://www.artificialintelligence-news.com/news/samsung-benchmarks-real-productivity-enterprise-ai-models/

What's your biggest AI productivity hurdle? #AIBenchmarking #EnterpriseAI #Samsung #LLMs #TechNews

ZDNet: ‘Humanity’s Last Exam’ benchmark is stumping top AI models – can you do any better?. “On Thursday, Scale AI and the Center for AI Safety (CAIS) released Humanity’s Last Exam (HLE), a new academic benchmark aiming to ‘test the limits of AI knowledge at the frontiers of human expertise,’ Scale AI said in a release. The test consists of 3,000 text and multi-modal questions on more than […]

https://rbfirehose.com/2025/01/28/zdnet-humanitys-last-exam-benchmark-is-stumping-top-ai-models-can-you-do-any-better/

ZDNet: ‘Humanity’s Last Exam’ benchmark is stumping top AI models – can you do any better? | ResearchBuzz: Firehose

ResearchBuzz: Firehose | Individual posts from ResearchBuzz

#AI #GenerativeAI #LLMs #AIBenchmarking: "Technology companies are locked in a frenzied arms race to release ever-more powerful artificial intelligence tools. To demonstrate that power, firms subject the tools to question-and-answer tests known as AI benchmarks and then brag about the results.

Google’s CEO, for example, said in December that a version of the company’s new large language model Gemini had “a score of 90.0%” on a benchmark known as Massive Multitask Language Understanding, making it “the first model to outperform human experts” on it. Not to be upstaged, Meta CEO Mark Zuckerberg was soon bragging that the latest version of his company’s Llama model “is already around 82 MMLU”

The problem, experts say, is that this test and others like it don’t tell you much, if anything, about an AI product — what sorts of questions it can reliably answer, when it can safely be used as substitute for a human expert, or how often it avoids “hallucinating” false answers. “The yardsticks are, like, pretty fundamentally broken,” said Maarten Sap, an assistant professor at Carnegie Mellon University and co-creator of a benchmark. The issues with them become especially worrisome, experts say, when companies advertise the results of evaluations for high-stakes topics like health care or law."

https://themarkup.org/artificial-intelligence/2024/07/17/everyone-is-judging-ai-by-these-tests-but-experts-say-theyre-close-to-meaningless

Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless – The Markup

Benchmarks used to rank AI models are several years old, often sourced from amateur websites, and, experts worry, lending automated systems a dubious sense of authority