Built a panel of models to judge each other. It ran for a week. Dashboard was green. Then I read what was actually winning. Not the sharpest answers. The ones that sounded like the judges. A popularity contest wearing a rubric. The fix: take away the name tags. Karpathy had already shipped it. I just had to admit what I had built.

https://praveenlavu.com/dispatch/anonymized-peer-review

#llm #evaluation #machinelearning #aiengineering #buildinpublic

LLM Self-Preference Bias: How Anonymized Peer Review Fixes It · Praveen Lavu

I wired a multi-model judge panel to pick the best output, and it spent a week quietly voting for its own architecture family. Not the best answer. The one that sounded like the judge. Here is the night I caught it, and the one-line fix that a researcher named Karpathy had already shipped.

Praveen Lavu

Demokratie leben!: Über 115 Mio. Euro für Evaluation und Forschung

Über 115 Millionen Euro fließen 2025-2026 in die Evaluation des Bundesprogramms Demokratie leben. Das Deutsche Jugendinstitut erhält dabei die größten Mittel. Wie wirkt sich die intensive Überprüfung auf die praktische Arbeit vor Ort aus? #Politik #Bundesprogramm #Evaluation #Demokratieförderung #Forschung #Bundestag #drucksachlich
https://drucksachlich.de/demokratie-leben-ueber-115-mio-euro-fuer-evaluation-und-forschung/

Outil d’évaluation d’une technologie : HU score

Le HU Score : évaluez votre technologie avant qu'elle ne vous évalueVous avez déjà eu cette sensation désagréable ? Vous posez votre téléphone après vingt minutes de scroll, sans vraiment savoir ce que vous y avez cherché, ni ce que vous en ramenez. Ou vous réalisez, en lisant les conditions d'utilisation d'une application, que vous n'avez aucune idée de ce qu'elle fait de vos photos, de vos messages, de votre localisation.Ce n'est pas une question de volonté ou de compétences […]

https://hu-tech.ch/outil-devaluation-dune-technologie-hu-score/

RT @usr_bin_roygbiv: TRANSLASION: Ich werde beginnen, alle meine OMP- und Harness-Evaluationsergebnisse hier in Echtzeit zu veröffentlichen:

mehr auf Arint.info

#AI #Evaluation #HarnessEvals #Omp #RealTime #RoyBench #arint_info

https://x.com/usr_bin_roygbiv/status/2066295108870320452#m

Arint - SEO+KI (@[email protected])

<p>RT @usr_bin_roygbiv: TRANSLASION: Ich werde beginnen, alle meine OMP- und Harness-Evaluationsergebnisse hier in Echtzeit zu veröffentlichen:</p> <p><a href="https://arint.info/@Arint/116752124296180853">mehr</a> auf <a href="https://arint.info/">Arint.info</a></p> <p>#AI #Evaluation #HarnessEvals #Omp #RealTime #RoyBench #arint_info</p> <p><a href="https://x.com/usr_bin_roygbiv/status/2066295108870320452#m">https://x.com/usr_bin_roygbiv/status/2066295108870320452#m</a></p>

Mastodon Glitch Edition

L’armée de Terre a évalué le 120MC, un nouveau système de mortier de 120 mm mobile proposé par Thales

https://fed.brid.gy/r/https://www.opex360.com/2026/06/14/larmee-de-terre-a-evalue-le-120mc-un-nouveau-systeme-de-mortier-de-120-mm-mobile-propose-par-thales/

Great job #IEG on the #evaluation of the World Bank Group Strategy for Fragility, Conflict, and Violence (FCV),2020–25. Love the 2-pagers with full report QR codes. The findings; I think you can replace #FCV with 'gender' or 'indigenous peoples' and findings would be similar.
https://ieg.worldbankgroup.org/sites/default/files/Data/Evaluation/files/snapshot-evaluation-FCV_Strategy.pdf?deliveryName=DM283584
Build 2026: From observability to ROI for AI agents on any framework  | Microsoft Foundry Blog

9 min read · June 3, 2026 · Sebastian Kohlmeier    Shipping an AI agent is the easy part. Keeping it accurate, safe, and accountable in production is

Microsoft Foundry Blog
Man charged in attempted arson of Montreal-area synagogue awaiting psychological evaluation
A Quebec court judge postponed the suspect’s hearing Monday morning because a psychological evaluation requested by the Crown had not been completed.
https://www.cbc.ca/news/canada/montreal/synagogue-arson-westmount-montreal-9.7227159?cmp=rss

How much does it tell you when an AI solves a hard math problem? Forty-nine mathematicians built 100 research-level questions with known answers. With a solve counting as one correct run in twenty, frontier models left only two unsolved after many tries and heavy-thinking modes. But many of those solves came on just one to four of the twenty runs, so the count treats a lucky hit the same as a reliable answer.

https://benjaminhan.net/posts/20260606-benchmarks-in-leipzig/?utm_source=mastodon&utm_medium=social

#AI #AIforScience #Mathematics #Evaluation

Benchmarks in Leipzig – synesis

Forty-nine mathematicians built 100 research-level math questions with known answers, and frontier LLMs left 41 unsolved on a single attempt but only 2 once given many tries and heavy-thinking modes.

synesis
Fast die Hälfte der Kindergärten in Wien mit Bedarf an Deutschförderung bekommt keine

Nun ist es amtlich: Die Neos sind am Ziel, die Sprachförderkräfte in fünf Jahren auf 500 aufzustocken, gescheitert. Stadträtin Emmerling verweist auf Personalnot, die Grünen sehen Managementversagen

DER STANDARD