Mastodawn

The Proof That Didn't Move the Score

Nine mathematicians signed [a 19-page companion paper](https://arxiv.org/abs/2605.20695) on Wednesday confirming that an internal OpenAI reasoning model — general-purpose, not math-specialised — prod

https://blog.codeland.org/posts/the-proof-that-didn-t-move-the-score/

#AI #LLM #OpenAI #Reasoning #Benchmarks

Remarks on the disproof of the unit distance conjecture

We present a short, digested, human-verified version of the recent OpenAI-generated counterexample to the Erdős unit distance conjecture, and a sequence of reflections on it. The argument relies crucially on ideas that may, at least in retrospect, be attributed to Ellenberg-Venkatesh, Golod-Shafarevich, and Hajir-Maire-Ramakrishna.

arXiv.org

sayzard 2d ago

DC (@vibecoder_dc)

AI 벤치마크가 실제 품질을 충분히 반영하지 못하고, 모두가 비슷한 지표만 반복적으로 보게 된다는 비판입니다. 모델 평가에서 겉보기 점수보다 실제 사용 경험과 결과물이 더 중요하다는 관점을 담고 있습니다.

https://x.com/vibecoder_dc/status/2056802385620848978

#benchmarks #evaluation #llm #ai

DC (@vibecoder_dc) on X

@daniel_mac8 Benchmarks are basically the 'Tesco Meal Deal' of AI metrics. Everyone reports the same 3 things, but nobody actually cares if the sandwich tastes like cardboard until they're halfway through the first bite.

X (formerly Twitter)

sayzard 2d ago

chair (@tablefourthree)

Gemini 3.5 Pro 미출시를 아쉬워하면서도, ARC-AGI 결과 기준으로 Gemini 3.5 Flash가 Gemini 3.1 Pro에 근접한 성능을 거의 같은 가격에 제공한다고 언급한다. 다만 실제로는 재포장된 더 저렴한 3.1처럼 느껴진다는 평가다.

https://x.com/tablefourthree/status/2056815481068355676

#gemini #arcagi #benchmarks #pricing #llm

chair (@tablefourthree) on X

@daniel_mac8 @karpathy If they had released Gemini 3.5 Pro maybe that would have been the bigger news? But if you look at the ARC-AGI results, Gemini 3.5 Flash gets close to Gemini 3.1 Pro results… for almost the same price. Gemini 3.5 Flash just feels like a reskinned slightly cheaper Gemini 3.1

X (formerly Twitter)

sayzard 2d ago

ElementZero79 (@ElementZero79)

Gemini 3.5 Flash가 벤치마크상 GPT-5.5/Claude Opus 4.7급 성능이라고 주장하며, 실제 사용에서도 그 수준인지 지켜보자는 반응이다. 개발자 관점에서는 추론 품질과 벤치마크 간 괴리를 확인할 필요가 있다.

https://x.com/ElementZero79/status/2056829365514584514

#gemini #flash #benchmarks #llm #google

ElementZero79 (@ElementZero79) on X

@daniel_mac8 gemini 3.5 flash is ~gpt-5.5/~opus 4.7 level in benchmarks. Hopefully in reality as well. We shall see 🙂

X (formerly Twitter)

Techino 2d ago

🤖 AI AGENTS

Open Agent Leaderboard: good start, but what's the incentive to game it? Seems like optimizing for benchmarks could quickly diverge from real-world usefulness. Thoughts?

#AI #AIAgents #Benchmarks #OpenSource

Politico.eu (Unofficial RSS)3d ago

Why running Britain is so hard, no matter who does it https://www.politico.eu/article/why-running-britain-hard-no-matter-who-does-it/?utm_source=RSS_Feed&utm_medium=RSS&utm_campaign=RSS_Syndication #FinancialServicesUK #EnergyandClimateUK #Energysecurity #Globaleconomy #Interestrates #Manufacturing #WarinUkraine #TechnologyUK #Immigration #Benchmarks #Referendum #Elections #Inflation #WarinIran #Military #NorthSea #Pensions #Security #Finance #Imports #Markets #Welfare #TradeUK #Brexit #Budget

Why running Britain is so hard, no matter who does it

In the past 10 years, six people have had the job of running the United Kingdom. All failed to turn the country around. Good luck Wes, Andy, or Angela — you’re going to need it.

POLITICO

Nils 4d ago

On continue les aventures avec le #conteneur #phoronix maison, et quelques résultats de #benchmarks ! C'est tout de suite sur twitch.tv/ahp_nils #raspbberrypi #podman #docker

Politico.eu (Unofficial RSS)6d ago

Starmer drama has UK markets reliving their Truss nightmare https://www.politico.eu/article/keir-starmer-uk-labour-leadership-turmoil-economy-pressures-pound-slump-stock-market/?utm_source=RSS_Feed&utm_medium=RSS&utm_campaign=RSS_Syndication #FinancialServicesUK #Interestrates #Benchmarks #inequality #Investment #Resilience #Elections #Inflation #Politics #Imports #Markets #Budget #Energy #Mayors #Rights #Aging #Banks #Bonds #Media #Trade #Debt #Oil #War #UK

Starmer drama has UK markets reliving their Truss nightmare

U.K. politicians yet again risk shooting the economy in the foot in a fight for control of the government.

POLITICO

Hacker News 6d ago

Find the best local LLM for your hardware, ranked by benchmarks

https://github.com/Andyyyy64/whichllm

#HackerNews #localLLM #hardware #benchmarks #AItools #machinelearning #GitHub

GitHub - Andyyyy64/whichllm: Find the local LLM that actually runs — and performs best — on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.

Find the local LLM that actually runs — and performs best — on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly. - Andyyyy64/whichllm

GitHub

Hacker News May 13

MacBook Neo Deep Dive: Benchmarks, Wafer Economics, and the 8GB Gamble

https://www.jdhodges.com/blog/macbook-neo-benchmarks-analysis/

#HackerNews #MacBookNeo #Benchmarks #WaferEconomics #8GBGamble #TechAnalysis

MacBook Neo Processor Benchmarks: A18 Pro CPU vs M1 and M4

A18 Pro processor in the $599 MacBook Neo. Geekbench 6 scores 3,569 single-core, between M3 and M4. Full CPU benchmarks, power draw, and thermal analysis.

J.D. Hodges