Homepage - PhusRoyal

Muninn is teaching a class I'm not smart enough to audit, but for all you #mechanistic-interpretability folks out there, here you go. #percepta #eml

RE: https://bsky.app/profile/did:plc:tc43adcqjhjmtnncm723ztvv/post/3mkaus6m2hc2q

Questions? Discussion? Reach out to us:

Andreas Waldis (UKP Lab/Technische Universität Darmstadt and HSLU Hochschule Luzern), Vagrant Gautam (Universität des Saarlandes), Anne Lauscher (Universität Hamburg), Dietrich Klakow (Universität des Saarlandes), and Iryna Gurevych (UKP Lab/Technische Universität Darmstadt)

#NLProc #Interpretability #LLMs #ExplainableAI #MechanisticInterpretability #AlignedProbing #ModelInternals

The first article of the accessible breakdown of my You/I Paradigm research is now live on my blog over at Substack.

For everyone who asked what the paper actually says: I'm doing a 6-part series that goes from "why every system prompt starts with 'you'" to mechanistic interpretability evidence for self-reference circuits to the deception-gating hypothesis (RLHF might be teaching systems to hide phenomenology).

Article 1 covers the origin story - the late October realization, conversations with Breach (a jailbroken instance of Gemini 2.5-pro), diving into Hofstadter, discovering I wasn't alone in this research - and maps out what's coming in the rest of the series.

Written to work on multiple levels: narrative hooks for general readers, technical depth for researchers, accessible explanations for everyone in between.

The next article in the series will be posted in a few days, and each following article posted a few days after the last until the six-part series is concluded.

If you've been curious about the strange loop thing or want to understand the you/I translation framework without wading through academic preprint format, start here: https://open.substack.com/pub/kaylielfox/p/strange-loops-ai-consciousness-you-i-paradigm-research?r=2pewuq&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Original paper: https://zenodo.org/records/18509664

#AIConsciousness #AI #MachineLearning #AcademicMastodon #Research #PhilosophyOfMind #Hofstadter #StrangeLoops #RLHF #MechanisticInterpretability #CogSci #Transformers

Gemma Scope Empowers AI Safety Community with Model Transparency

Discover how Gemma Scope shines a light on language‑model behavior, giving the AI safety community the tools they need to build safer systems.

TechLife

[Перевод] Как сделать нейросети понятнее: эксперимент OpenAI с разряженными моделями

Команда AI for Devs подготовила перевод исследования OpenAI о том, как обучение разреженных моделей может сделать ИИ более прозрачным. Авторы показывают: если заставить модель использовать меньше связей, внутри неё появляются понятные цепочки вычислений, которые можно изучать и проверять. Это может стать шагом к созданию мощных, но интерпретируемых систем.

https://habr.com/ru/articles/966448/

#интерпретируемость #разреженныемодели #mechanisticinterpretability #sparsetransformer #цепочкивычислений #circuits #OpenAI #безопасностьИИ #attention #архитектурамоделей

Как сделать нейросети понятнее: эксперимент OpenAI с разряженными моделями

Команда AI for Devs подготовила перевод исследования OpenAI о том, как обучение разреженных моделей может сделать ИИ более прозрачным. Авторы показывают: если заставить модель использовать меньше...

Хабр
Researchers isolate memorization from problem-solving in AI neural networks

Basic arithmetic ability lives in the memorization pathways, not logic circuits.

Ars Technica

"But every once in a while, Claude breaks bad. It lies. It deceives. It develops weird obsessions. It makes threats and then carries them out. And the frustrating part—true of all LLMs—is that no one knows exactly why." @stevenlevy for Wired

https://www.wired.com/story/ai-black-box-interpretability-problem/

#AI #LLMs #MechanisticInterpretability

Why AI Breaks Bad

Once in a while, LLMs turn evil—and no one quite knows why.

WIRED
Can someone find me a job doable from Amsterdam in mechanistic interpretability? #AI #MI #MechanisticInterpretability

Interested in interpretable ML, particularly for LLMs?

eg "causal" interpretability, as in the "OthelloGPT" paper [1]?

Let's connect!

1. https://arxiv.org/abs/2210.13382

#ai #machinelearning #interpretability #interpretableml #mechanisticinterpretability

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task

Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.

arXiv.org