๐Ÿ“ข Weโ€™ll present our TACL paper โ€œ๐—”๐—น๐—ถ๐—ด๐—ป๐—ฒ๐—ฑ ๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—ถ๐—ป๐—ด: ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐—ป๐—ด ๐—ง๐—ผ๐˜…๐—ถ๐—ฐ ๐—•๐—ฒ๐—ต๐—ฎ๐˜ƒ๐—ถ๐—ผ๐—ฟ ๐—ฎ๐—ป๐—ฑ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐—œ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น๐˜€โ€ at #EACL2026 ๐Ÿ‡ฒ๐Ÿ‡ฆ

๐Ÿ”ฅ Key finding:
LMs generate less toxic output when they more strongly encode input toxicity internally

๐Ÿงต https://bsky.app/profile/tresiwald.bsky.social/post/3mdfswxr5jn2y

๐Ÿ“„ https://arxiv.org/abs/2503.13390

Full paper & code: https://alignedprobing.github.io/

Questions? Discussion? Reach out to us:

Andreas Waldis (UKP Lab/Technische Universitรคt Darmstadt and HSLU Hochschule Luzern), Vagrant Gautam (Universitรคt des Saarlandes), Anne Lauscher (Universitรคt Hamburg), Dietrich Klakow (Universitรคt des Saarlandes), and Iryna Gurevych (UKP Lab/Technische Universitรคt Darmstadt)

#NLProc #Interpretability #LLMs #ExplainableAI #MechanisticInterpretability #AlignedProbing #ModelInternals