Mastodawn

ICLR 2026 – Institutional Affiliations Dataset and Analysis

ICLR 2026 학회에서 발표된 5,356편의 논문을 대상으로 PDF에서 직접 추출한 저자 소속 기관 데이터를 정제하여 공개하는 오픈소스 파이프라인과 데이터셋이 공개되었다. 이 데이터셋은 OpenReview 프로필 기반 소속 정보의 오류를 방지하며, 기관별 논문 수를 기준으로 한 시각화 차트도 제공한다. 파이프라인은 PDF 파싱, 정규화, 시각화까지 전 과정을 포함하며, 연구자와 AI 엔지니어가 기관별 연구 동향을 분석하는 데 유용하다. 깃허브에서 소스코드와 데이터셋을 확인하고 재현할 수 있다.

https://github.com/DmytroLopushanskyy/iclr2026-affiliations

#dataset #pdfparser #iclr #bibliometrics #machinelearningresearch

GitHub - DmytroLopushanskyy/iclr2026-affiliations: PDF-derived institutional affiliations for 5,356 ICLR 2026 accepted papers — full pipeline (scrape → parse → render), clean dataset (CSV + XLSX), and treemap charts.

PDF-derived institutional affiliations for 5,356 ICLR 2026 accepted papers — full pipeline (scrape → parse → render), clean dataset (CSV + XLSX), and treemap charts. - DmytroLopushanskyy/iclr2026-a...

GitHub

Benjamin Han 5d ago

SCoRe is a two-stage on-policy RL recipe that teaches a language model to revise its own answers using only self-generated data. On Gemini 1.5 Flash and 1.0 Pro it gains 15.6 points on MATH and 9.1 on HumanEval over the base model. At matched inference budgets, sequential self-correction beats parallel sampling up to 32 samples.

https://benjaminhan.net/posts/20260512-score/?utm_source=mastodon&utm_medium=social

#Paper #LLMs #RL #Metacognition #Reasoning #ICLR #AI

Training Language Models to Self-Correct via Reinforcement Learning (SCoRe) – synesis

A two-stage on-policy RL recipe teaches Gemini 1.0 Pro and 1.5 Flash to revise their own answers, gaining 15.6 points on MATH and 9.1 points on HumanEval over the base model.

synesis

Benjamin Han 5d ago

Let's Verify Step by Step compares process and outcome supervision on MATH. The process-reward model reaches 78.2% best-of-1860 vs 72.4% for outcome. But that gap narrows fast at small N, where most deployments actually live.

https://benjaminhan.net/posts/20260512-lets-verify-step-by-step/?utm_source=mastodon&utm_medium=social

#Paper #LLMs #Reasoning #Mathematics #ICLR #OpenAI #AI

Let’s Verify Step by Step – synesis

OpenAI compares outcome vs. process supervision for math reasoning and finds that step-level human feedback trains dramatically more reliable reward models on MATH.

synesis

Benjamin Han May 5

Conformal Language Modeling (CLM) adapts conformal prediction to generative LMs: sample candidates, stop when a calibrated rule fires, return a set guaranteed to contain an acceptable answer. The more interesting half is the component-level filter — per-phrase coverage, not just set-level. That's the primitive for hallucination flagging: highlight the vetted phrases, leave the rest for review.

https://benjaminhan.net/posts/20260505-conformal-language-modeling/?utm_source=mastodon&utm_medium=social

#ConformalPrediction #LLMs #Hallucination #ICLR #AI

Conformal Language Modeling – synesis

A conformal-prediction procedure for generative language models that samples until a calibrated stopping rule fires, filters low-quality candidates, and returns a set guaranteed to contain an acceptable answer.

synesis

Benjamin Han May 1

MASS optimizes multi-agent LLM systems by interleaving prompt and topology search: block-level prompts, topology rejection sampling, then workflow-level prompts.

Topology gets quietly demoted. Ablation on Gemini 1.5 Pro: ~6% gain from block prompts, 3% from topology, 2% from workflow prompts. Prompt tuning dominates — contradicts the topology-first thesis of ADAS and AFlow.

https://benjaminhan.net/posts/20260430-multi-agent-system-search/?utm_source=mastodon&utm_medium=social

#LLMs #AI #AgenticSystems #PromptEngineering #Google #ICLR

Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies – synesis

Google’s MASS framework optimizes multi-agent systems by interleaving prompt and topology search across three stages, beating ADAS and AFlow on eight benchmarks.

synesis

Benjamin Han May 1

DSPy turns LM pipelines into typed-module graphs and compiles them end-to-end against a single metric, bootstrapping its own few-shot demonstrations.

The programming-model layer is the real contribution, not any specific teleprompter. Once pipelines are typed graphs, pipeline-level search (MASS, MIPRO) becomes possible in a way it wasn't with string-template prompts.

https://benjaminhan.net/posts/20260430-dspy/?utm_source=mastodon&utm_medium=social

#LLMs #AI #PromptEngineering #NLP #Stanford #ICLR

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines – synesis

DSPy introduces a programming model and compiler for LM pipelines: treat prompts as modules with typed signatures, then let a teleprompter bootstrap demonstrations that beat hand-engineered chains.

synesis

Benjamin Han May 1

EvoPrompt runs an evolutionary search over a population of prompts, with an LLM implementing crossover and mutation. Differential Evolution beats Genetic Algorithm on most BIG-Bench Hard tasks.

One of the cleanest early examples of an LLM as *operator* in an optimization loop, not as the thing being optimized. That pattern then shows up across prompt-and-agent design: DSPy teleprompters, MASS, MetaSPO.

https://benjaminhan.net/posts/20260430-evoprompt/?utm_source=mastodon&utm_medium=social

#LLMs #AI #PromptEngineering #NLP #Microsoft #ICLR

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers – synesis

EvoPrompt uses LLMs to implement coherent crossover and mutation operators over a population of prompts, beating human-engineered prompts by up to 25% on BIG-Bench Hard.

synesis

Benjamin Han May 1

SelfReflect measures whether an LLM's text summary of its uncertainty matches its actual answer distribution. Across 20 modern models: it doesn't, unless the model sees samples of its own answers first.

The negative result does more work than the metric itself. Fits a growing line where LLM self-reports shouldn't be trusted as introspection. Practical workaround isn't cheap: N forward passes to sample, then a summarize pass.

https://benjaminhan.net/posts/20260430-selfreflect-internal-distribution/?utm_source=mastodon&utm_medium=social

#LLMs #AI #Evaluation #Apple #ICLR

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution? – synesis

An Apple paper introduces an information-theoretic metric for whether an LLM’s text summary of its own uncertainty matches the distribution it actually samples from, and finds that current models cannot do this without sampling help.

synesis

sayzard Apr 27

Wes Roth (@WesRoth)

Sakana AI가 ICLR 2026에서 TRINITY를 공개했습니다. TRINITY는 거대한 단일 AI 모델을 더 키우는 대신, 가벼운 코디네이터가 여러 최첨단 모델에 작업을 동적으로 분배해 문제를 해결하는 방식입니다. 대규모 스케일링 중심 흐름에 도전하는 새로운 AI 아키텍처 제안입니다.

https://x.com/WesRoth/status/2048809402254192698

#sakanaai #trinity #iclr #multimodel #aiarchitecture

Wes Roth (@WesRoth) on X

Sakana AI unveiled TRINITY at ICLR 2026, challenging the industry's obsession with endlessly scaling massive, monolithic AI models. Instead, TRINITY introduces a lightweight "coordinator" that dynamically routes tasks across a diverse pool of existing frontier models to solve

X (formerly Twitter)

Fundación Vía Libre Apr 23

AGENDA 🗓️ | #ICLR 2026

Del 23 al 27 de abril se realizará en Río de Janeiro la Conferencia Internacional sobre Representaciones de Aprendizaje 2026 (ICLR), uno de los encuentros más importantes a nivel global en inteligencia artificial y aprendizaje automático.

El 26 abril, a las 9hs (GMT-3), dentro del workshop “Razonamiento lógico de los modelos de lenguaje de gran tamaño”, Sofía Martinelli presentará un paper que propone una taxonomía de recursos para la evaluación de sesgos en modelos de razonamiento, participando tanto con una charla como con una presentación en formato póster. El paper fue realizado junto a Luciana Benotti y Guido Ivetta.

Más información: https://iclr.cc/