Tied Crosscoders: Explaining Chat Behavior from Base Model — LessWrong

Abstract We are interested in model-diffing: finding what is new in the chat model when compared to the base model. One way of doing this is training…