Mastodawn

AVeriTeC (NeurIPS 2023): 4,568 real-world fact-checked claims, web-retrieved evidence, four-way labels, temporal-leak-free split.

Two structural gaps: gold answers are frozen but the retrieval surface isn't (two systems a year apart hit different Google), and the not-enough-evidence class rewards weak retrievers — predicting NEI when retrieval fails matches gold by coincidence.

https://benjaminhan.net/posts/20260507-averitec/?utm_source=mastodon&utm_medium=social

#Paper #Benchmark #FactVerification #NeurIPS #AI

AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web – synesis

A 4,568-claim fact-checking benchmark sourced from 50 real fact-checking organizations, with web-retrieved evidence, a 4-way verdict label including not-enough-evidence, and a temporal-leak-free split.

synesis

sayzard 2h ago

Apple MLX vs. llama.cpp: compared and benchmarked [video]

Protorikis가 공개한 벤치마크 영상에서는 Apple MLX와 llama.cpp(GGUF 런타임 포함)를 실제 사용 시나리오에서 비교했다. 테스트는 MacBook Pro M3 Max 환경에서 Qwen3.6 35B 모델을 대상으로 진행되었으며, MLX가 특정 상황에서 속도 향상을 보이나, 프롬프트 캐싱 부재, 메모리 압박, 불안정한 성능 문제도 발견되었다. Ollama의 MLX 엔진(NVFP4 포함)과 LM Studio 백엔드도 함께 비교되어, GGUF와 MLX 중 선택 시 참고할 만한 실무적 인사이트를 제공한다. 이 영상은 MLX와 llama.cpp의 실제 성능 차이를 이해하고자 하는 AI 개발자에게 유용하다.

https://www.youtube.com/watch?v=ZwCbChJWXkQ

#applemlx #llama.cpp #benchmark #runtime #gguf

Apple MLX vs llama.cpp: Which is Really Faster? (4 Runtimes - Ollama Included)

YouTube

sayzard 9h ago

Chasing AI Memory SOTA: Beating the Benchmark, Missing the Point

이 글은 AI 메모리 시스템의 최신 벤치마크 점수들이 실제 성능을 제대로 반영하지 못하는 문제를 지적한다. 대표적인 메모리 벤치마크인 LoCoMo와 LongMemEval의 한계와 데이터셋의 인위성, 모호한 평가 기준, 그리고 실제 운영 환경에서 요구되는 다양한 메모리 기능을 제대로 테스트하지 못하는 점을 상세히 분석한다. 또한 벤치마크 결과가 하이퍼파라미터 설정, 평가자 모델 등에 크게 의존해 비교가 어렵다는 점도 강조한다. 결국 SOTA 점수는 제한된 조건 하에서의 실험 결과일 뿐, 실제 메모리 문제 해결이나 사용자 경험 개선을 보장하지 않는다고 결론짓는다.

https://xmemory.ai/chasing-sota-in-ai-memory/

#aimemory #benchmark #evaluation #longmemeval #locomo

Chasing AI memory SOTA: Beating the Benchmark, Missing the Point

Why agentic memory benchmark numbers can be noisy, and what we should measure instead.

xmemory Website

sayzard 18h ago

fly51fly (@fly51fly)

Meta FAIR 연구진이 언어 모델이 프로그램을 처음부터 다시 재구성할 수 있는지 평가하는 ProgramBench를 공개했다. 코드 생성·복원 능력을 측정하는 벤치마크로, 모델의 실질적 프로그래밍 능력 평가에 중요한 자료다.

https://x.com/fly51fly/status/2052137222384853488

#programbench #languagemodels #codegeneration #benchmark #meta

fly51fly (@fly51fly) on X

[AI] ProgramBench: Can Language Models Rebuild Programs From Scratch? J Yang, K Lieret, J Ma, P Thakkar… [Meta FAIR] (2026) https://t.co/VEkc5PeIwh

X (formerly Twitter)

sayzard 19h ago

Sudo su (@sudoingX)

27B 로컬 모델이 자신의 벤치마크 보고서를 직접 작성하는 사례를 소개한다. Carnice-v2 27B가 하드웨어, 모델 파일, llama.cpp 커밋을 찾아 자기 평가를 수행하는 등 로컬 에이전트형 AI의 가능성을 보여준다.

https://x.com/sudoingX/status/2052051592770469894

#localmodel #benchmark #agenticai #llamacpp #qwen

Sudo su (@sudoingX) on X

watching a 27b local model write its own benchmark report just now and i'm sitting with this for a sec. gave carnice-v2 27b (kaios SFT on qwen 3.6 dense, trained on hermes agent traces) a self-report card task, find your hardware, find your model file, find the llama.cpp commit

X (formerly Twitter)

heise online English 1d ago

Most important server CPU benchmark gets an update after 9 years

CPU designers are switching to the SPEC CPU 2026 benchmark. The new version even runs on a Raspberry Pi.

https://www.heise.de/en/news/Most-important-server-CPU-benchmark-gets-an-update-after-9-years-11284559.html?wt_mc=sm.red.ho.mastodon.mastodon.md_beitraege.md_beitraege&utm_source=mastodon

#Benchmark #Prozessoren #IT #RaspberryPi #Server #news

Most important server CPU benchmark gets an update after 9 years

CPU designers are switching to the SPEC CPU 2026 benchmark. The new version even runs on a Raspberry Pi.

heise online

Nils 1d ago

Petite satisfaction du jour : https://hub.docker.com/r/ahpnils/phoronix-test-suite Il faudrait que je fasse une page web pour le projet, et un readme sur https://framagit.org/nils/ahp-pts mais les images sont utilisables, même sur #raspberrypi v1 🙂#benchmark #phoronix #docker #podman

ahpnils/phoronix-test-suite - Docker Image

David Croyle 1d ago

#PhotoOfTheDay is an aisle of the old Map Room at the USGS West Coast Headquarters in Menlo Park, California. This place was like a temple to topographic maps, with some very old handmade benchmark disks on display too.

Sadly, it's been gone for a while now. We did an interview with the head of mapping there (I was helping to make a benchmarking video) and he explained that he had none of his people left that did actual field work. :(

#photo #photography #maps #usgs #benchmark #explore

sayzard 1d ago

How AI Benchmarks Work – and When Scores Mislead
이 기사는 AI 벤치마크가 어떻게 작동하는지, 그리고 벤치마크 점수가 왜 때때로 오해를 불러일으키는지 설명한다. 벤치마크 점수는 모델 성능을 평가하는 중요한 지표지만, 데이터 중복(오염), 점수 포화, 그리고 점수 조작(게임화) 문제로 인해 실제 성능과 차이가 발생할 수 있다. 신뢰할 수 있는 점수를 얻기 위해서는 테스트 환경의 엄격한 통제와 검증이 필수적임을 강조한다. 또한, 벤치마크의 한계와 이를 극복하기 위한 방법들을 구체적으로 제시한다.

https://agent-benchmarks.com/

#ai #benchmark #evaluation #modelperformance #testing

Learn how AI benchmarks work

heiseonline 1d ago

CPU-Designer satteln auf den Benchmark SPEC CPU 2026 um. Die neue Version läuft sogar auf einem Raspberry Pi. #Benchmark

Wichtigster Server-CPU-Benchma...

Wichtigster Server-CPU-Benchmark bekommt ein Update nach 9 Jahren

CPU-Designer satteln auf den Benchmark SPEC CPU 2026 um. Die neue Version läuft sogar auf einem Raspberry Pi.

heise online