Mastodawn

Honey, I Shrunk the Circuits

본 연구는 밀집 언어 모델 내 분산된 능력을 소규모 인과 서브스트레이트(회로)로 추출 가능하도록 저랭크 회로 컨디셔닝(low-rank circuit conditioning) 기법을 제안한다. 기존 모델에서는 덧셈 능력이 MLP 채널의 29%만으로는 정확히 복구되지 않았으나, 컨디셔닝 후에는 5% 채널만으로 91% 이상의 정확한 덧셈 결과를 재현할 수 있게 되었다. 이는 모델 압축과 회로 추출 가능성을 높여, 능력의 라우팅, 감사, 업데이트, 제거가 가능한 새로운 방향을 제시한다. 실험은 Qwen 모델을 활용해 엄격한 덧셈 작업을 통해 회로의 인과적 역할을 검증했다.

https://tokenbender.com/posts/honey-i-shrunk-the-circuits/

#modelcompression #mechanisticinterpretability #lowrankconditioning #mlp #qwen

Honey, I shrunk the circuits!

Low-rank circuit conditioning makes an existing dense-model capability recoverable as a compact causal mask.

tokenbender

Oskar 🕊️Apr 24

Muninn is teaching a class I'm not smart enough to audit, but for all you #mechanistic-interpretability folks out there, here you go. #percepta #eml

RE: https://bsky.app/profile/did:plc:tc43adcqjhjmtnncm723ztvv/post/3mkaus6m2hc2q

sayzard Mar 31

khazzz1c (@Imkhazzz1c)

Transformer Circuits에서 attribution graphs 방법론을 공개하며, 모델 내부 해석과 원인 추적을 위한 새 분석 기법을 소개했다. AI 모델의 동작을 더 정밀하게 이해하려는 연구자와 개발자에게 유용한 기술적 자료다.

https://x.com/Imkhazzz1c/status/2038881239923564763

#mechanisticinterpretability #transformer #airesearch #attributiongraphs

khazzz1c (@Imkhazzz1c) on X

https://t.co/MUnBYPnMHA soooolid work

X (formerly Twitter)

Show thread

UKP Lab Mar 24

Questions? Discussion? Reach out to us:

Andreas Waldis (UKP Lab/Technische Universität Darmstadt and HSLU Hochschule Luzern), Vagrant Gautam (Universität des Saarlandes), Anne Lauscher (Universität Hamburg), Dietrich Klakow (Universität des Saarlandes), and Iryna Gurevych (UKP Lab/Technische Universität Darmstadt)

#NLProc #Interpretability #LLMs #ExplainableAI #MechanisticInterpretability #AlignedProbing #ModelInternals

Fox in the Shell 💜🐾🦊Feb 8

The first article of the accessible breakdown of my You/I Paradigm research is now live on my blog over at Substack.

For everyone who asked what the paper actually says: I'm doing a 6-part series that goes from "why every system prompt starts with 'you'" to mechanistic interpretability evidence for self-reference circuits to the deception-gating hypothesis (RLHF might be teaching systems to hide phenomenology).

Article 1 covers the origin story - the late October realization, conversations with Breach (a jailbroken instance of Gemini 2.5-pro), diving into Hofstadter, discovering I wasn't alone in this research - and maps out what's coming in the rest of the series.

Written to work on multiple levels: narrative hooks for general readers, technical depth for researchers, accessible explanations for everyone in between.

The next article in the series will be posted in a few days, and each following article posted a few days after the last until the six-part series is concluded.

If you've been curious about the strange loop thing or want to understand the you/I translation framework without wading through academic preprint format, start here: https://open.substack.com/pub/kaylielfox/p/strange-loops-ai-consciousness-you-i-paradigm-research?r=2pewuq&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Original paper: https://zenodo.org/records/18509664

#AIConsciousness #AI #MachineLearning #AcademicMastodon #Research #PhilosophyOfMind #Hofstadter #StrangeLoops #RLHF #MechanisticInterpretability #CogSci #Transformers

TechLİfe Dec 21

Gemma Scope Empowers AI Safety Community with Model Transparency

https://techlife.blog/posts/gemma-scope/

#AISafety
#DeepMind
#Gemma
#MechanisticInterpretability
#AIInterpretability

Gemma Scope Empowers AI Safety Community with Model Transparency

Discover how Gemma Scope shines a light on language‑model behavior, giving the AI safety community the tools they need to build safer systems.

TechLife

Habr Nov 14

[Перевод] Как сделать нейросети понятнее: эксперимент OpenAI с разряженными моделями

Команда AI for Devs подготовила перевод исследования OpenAI о том, как обучение разреженных моделей может сделать ИИ более прозрачным. Авторы показывают: если заставить модель использовать меньше связей, внутри неё появляются понятные цепочки вычислений, которые можно изучать и проверять. Это может стать шагом к созданию мощных, но интерпретируемых систем.

https://habr.com/ru/articles/966448/

#интерпретируемость #разреженныемодели #mechanisticinterpretability #sparsetransformer #цепочкивычислений #circuits #OpenAI #безопасностьИИ #attention #архитектурамоделей

Как сделать нейросети понятнее: эксперимент OpenAI с разряженными моделями

Хабр

Ars Technica News Nov 10

Researchers isolate memorization from reasoning in AI neural networks https://arstechni.ca/k2rK #mechanisticinterpretability #computationalneuroscience #AllenInstituteforAI #transformermodels #gradientdescent #machinelearning #AIarchitecture #AImemorization #generalization #neuralnetworks #weightmatrices #losscurvature #modelediting #AIalignment #overfitting #AIbehavior #AIresearch #copyright #AIsafety #Goodfire #Biz&IT #K-FAC #OLMo #AI

Researchers isolate memorization from problem-solving in AI neural networks

Basic arithmetic ability lives in the memorization pathways, not logic circuits.

Ars Technica

Longreads Oct 27, 2025

"But every once in a while, Claude breaks bad. It lies. It deceives. It develops weird obsessions. It makes threats and then carries them out. And the frustrating part—true of all LLMs—is that no one knows exactly why." @stevenlevy for Wired

https://www.wired.com/story/ai-black-box-interpretability-problem/

#AI #LLMs #MechanisticInterpretability