Interpretable AI really wants to understand what neurons in LLMs are doing. But this effort is very likely to fail – and it's not the right approach to understand what AI is doing and why.
Like, today, there's weirdly a lot of press about how OpenAI just showed that "Language models can explain neurons in language models" (https://openai.com/research/language-models-can-explain-neurons-in-language-models). But look at the metrics – this was a failed effort. GPT-4 *cannot explain* what neurons in GPT-2 are doing.
More importantly, single-unit interpretability in LLMs is not the same as understanding why and what LLMs as a whole are doing. Even if you did understand when a handful of units activate, you will never be able to stitch these together into a general understanding of why an LLM says the words that it does.
LLMs may someday be able to explain themselves in plain language. But describing (in plain language) when each neuron fires is not going to get us there.
#interpretableAI #LLMs #openai