πŸ” In this #ResearchMonday's spotlight: a pivotal ICLR 2024 paper about fine-tuning LLM. It reveals that fine-tuning may only superficially align models, without deeply altering their pre-trained capabilities.

The authors also express enthusiasm for future research aimed at not just masking but potentially deleting or unlearning certain pre-trained capabilities, enhancing the safety and reliability of AI systems. πŸ›‘οΈπŸ€–

πŸ“œ https://arxiv.org/abs/2311.12786
🧡https://x.com/_robertkirk/status/1729531935637004717

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such hidden capabilities are relevant leads to sample-efficient 'revival' of the capability, i.e., the model begins reusing these capability after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

arXiv.org
πŸ”¬ This week's #ResearchMonday features a study titled "Scalable Extraction of Training Data from (Production) Language Models", revealing the brittleness of alignment in AI. The research shows that simply prompting a model like ChatGPT to repeatedly output a single word can lead to the unintended disclosure of its training data! ⛓️‍πŸ’₯ This finding challenges the effectiveness of current alignment techniques, highlighting significant security vulnerabilities. πŸ›‘οΈ
🧡 We are delighted to announce that our first paper done at Parameter Lab together with Naver AI Lab was accepted at #NeurIPS2023 as spotlight! πŸŽ‰ It represents a pioneering step towards empowering individuals with awareness and control over their personal data onlineπŸ•΅οΈβ€β™‚οΈ Below, a thread to present you "ProPILE: Probing Personal Information Leakage from Large Language Models"πŸ§‘β€πŸ”¬ This is the first of a series of tweets #ResearchMonday presenting one research paper each Monday about #LLM.