We present Context Length Probing, an embarrassingly simple, model-agnostic, #blackbox explanation technique for causal (#GPT-like) language models.

The idea is simply to check how predictions change as the left-hand context is extended token by token. This allows assigning "differential importance scores" to contexts as shown in the video.

Paper: https://arxiv.org/abs/2212.14815
Code: https://github.com/cifkao/context-probing
Demo: https://cifkao.github.io/context-probing/

#explainability #interpretability #Transformer #NLProc

🧵1/4

Black-box language model explanation by context length probing

The increasingly widespread adoption of large language models has highlighted the need for improving their explainability. We present context length probing, a novel explanation technique for causal language models, based on tracking the predictions of a model as a function of the length of available context, and allowing to assign differential importance scores to different contexts. The technique is model-agnostic and does not rely on access to model internals beyond computing token-level probabilities. We apply context length probing to large pre-trained language models and offer some initial analyses and insights, including the potential for studying long-range dependencies. The source code and an interactive demo of the method are available.

arXiv.org
In this plot, we show on an example how two different metrics (LM loss and a metric based on KL divergence) change as the context length increases (from right to left). Some context tokens cause abrupt changes, and we suggest the interpretation that these tokens bring important information not already covered by shorter contexts. 🧵2/4

The technique works with any causal LM, as long as it was trained to accept arbitrary text fragments (not necessarily starting at sentence or document boundary), which happens to be how large #GPT-like models (#GPT2, #GPT3, #GPTJ, ...) are usually trained.

The main trick is in realizing that the necessary probabilities can be computed efficiently by running the model along a sliding window. 🧵3/4

Specifically, to compute the output distributions for all positions in a text of length N and all context lengths up to a max length C, we just need to run inference along a sliding window of length C, i.e. do N forward passes on segments of length ≤C. (see the illustration in my previous post)

Notice that this is a lot like generating a new sequence from the model (the naïve way)! 🧵4/4