Jasmijn Bastings

@jasmijn@sigmoid.social
555 Followers
313 Following
57 Posts

Thread: Excited to announce the v1.0 release of the Learning Interpretability Tool (🔥LIT), an interactive platform to debug, validate, and understand ML model behavior. This release brings exciting new features — including layouts, demos, and metrics — and a simplified Python API. https://pair-code.github.io/lit

(1/5)

Learning Interpretability Tool

@mega @davidbau Check out our preprint for more details and analysis: https://arxiv.org/abs/2304.14767

This was a really fun project with @mega, Katja Filippova, Amir Globerson! #NLProc #NLP #XAI
Dissecting Recall of Factual Associations in Auto-Regressive Language Models

Transformer-based language models (LMs) are known to capture factual knowledge in their parameters. While previous work looked into where factual associations are stored, only little is known about how they are retrieved internally during inference. We investigate this question through the lens of information flow. Given a subject-relation query, we study how the model aggregates information about the subject and relation to predict the correct attribute. With interventions on attention edges, we first identify two critical points where information propagates to the prediction: one from the relation positions followed by another from the subject positions. Next, by analyzing the information at these points, we unveil a three-step internal mechanism for attribute extraction. First, the representation at the last-subject position goes through an enrichment process, driven by the early MLP sublayers, to encode many subject-related attributes. Second, information from the relation propagates to the prediction. Third, the prediction representation "queries" the enriched subject to extract the attribute. Perhaps surprisingly, this extraction is typically done via attention heads, which often encode subject-attribute mappings in their parameters. Overall, our findings introduce a comprehensive view of how factual associations are stored and extracted internally in LMs, facilitating future research on knowledge localization and editing.

arXiv.org
@mega @davidbau Our study was inspired by works on knowledge tracing (Kevin Meng, @davidbau @peterbhase) and mechanistic interpretability (@kevrowan @AnthropicAI). It introduces an in-depth view of factual predictions and facilitates new research directions for knowledge localization & editing.
@mega @davidbau Through per-layer gradient X input analysis (similar to @gsarti_ et al.) and “patching” experiments of early-layer representations, we further show the importance of the subject enrichment process for attribute extraction to happen.
@mega @davidbau Further analysis of these heads in the embedding space (@guy__dar) shows that they often encode subject-attribute mappings in their parameters. Some attention heads act as “knowledge hubs” with hundreds of such encoded mappings.
@mega @davidbau (B) information from the relation propagates to the prediction, and (C) the prediction representation “queries” the enriched subject to extract a specific attribute. Perhaps surprisingly, this extraction is typically done via attention heads.
@mega @davidbau Analyzing the information at these critical points, we unveil a three-step internal mechanism for attribute extraction: (A) the representation at the last-subject position goes through an enrichment process driven by the early MLP layers, to encode many subject-related attributes
@mega We prompt GPT-{2|J} with subject-relation queries (“Beats Music is owned by”) from CounterFact (@mengk20 @davidbau) and intervene on attention edges (similar to @hmohebbi75) to analyze how information is aggregated across layers and positions to predict the attribute (“Apple”).
This reveals two critical points where information propagates to the prediction: one from the relation positions ("is owned by", in the example) followed by another from the subject positions (“Beats Music”).
LMs capture many factual associations, but how do they recall them internally during inference? In a new preprint, we find that LMs build attribute-rich subject representations, from which attention heads extract the predicted attribute.
(with Mor Geva @mega, Katja Filippova, And Amir Globerson) đź§µ #NLP #NLProc

Airbnb

Waterbnb

Firebnb

Earthbnb

Long ago, the four bed-and-breakfasts lived in harmony

Everything changed when the Firebnb attacked