ICLR 2026 – Institutional Affiliations Dataset and Analysis

ICLR 2026 학회에서 발표된 5,356편의 논문을 대상으로 PDF에서 직접 추출한 저자 소속 기관 데이터를 정제하여 공개하는 오픈소스 파이프라인과 데이터셋이 공개되었다. 이 데이터셋은 OpenReview 프로필 기반 소속 정보의 오류를 방지하며, 기관별 논문 수를 기준으로 한 시각화 차트도 제공한다. 파이프라인은 PDF 파싱, 정규화, 시각화까지 전 과정을 포함하며, 연구자와 AI 엔지니어가 기관별 연구 동향을 분석하는 데 유용하다. 깃허브에서 소스코드와 데이터셋을 확인하고 재현할 수 있다.

https://github.com/DmytroLopushanskyy/iclr2026-affiliations

#dataset #pdfparser #iclr #bibliometrics #machinelearningresearch

GitHub - DmytroLopushanskyy/iclr2026-affiliations: PDF-derived institutional affiliations for 5,356 ICLR 2026 accepted papers — full pipeline (scrape → parse → render), clean dataset (CSV + XLSX), and treemap charts.

PDF-derived institutional affiliations for 5,356 ICLR 2026 accepted papers — full pipeline (scrape → parse → render), clean dataset (CSV + XLSX), and treemap charts. - DmytroLopushanskyy/iclr2026-a...

GitHub

Check out the #NeuroAI projects from the Impact Scholars Program! https://airtable.com/appbxcAKe1D5c5xwY/tblA6aWdM4mVjXIqH/viwaYt2Hk85zFUyiM

🤓 Learn more about:

  • Biological Connectivity Patterns as a Blueprint for Efficient Neural Architectures in Reinforcement Learning
  • How #AI Learns to Move like Us
  • Dynamical Similarity in Multitasking RNNs: Identifying Shared Motifs Across Task Periods
  • How Machines Learn to Think in Order

#ImpactScholars #ComputationalNeuroscience #Innovation #Neuromatch #AIforGood #MachineLearningResearch

Airtable | Everyone's app platform

Airtable is a low-code platform for building collaborative apps. Customize your workflow, collaborate, and achieve ambitious outcomes. Get started for free.

Airtable

Has anyone conducted their own experiments with training data extraction from offline-LLMs via repeated words ala Nasr, et al.'s "Scalable Extraction of Training Data from (Production) Language Models"? I'd be interested in acquiring your code. I want to conduct a more formal mathematical analysis of the phenomenon, but I'd like to peek under the hood a bit more first.

Ref: https://arxiv.org/abs/2311.17035

#MachineLearning #DeepLearning #AdversarialAttacks #MachineLearningResearch #AIResearch #AI

Scalable Extraction of Training Data from (Production) Language Models

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

arXiv.org