๐Ÿ” ๐—›๐—ผ๐˜„ ๐—ฐ๐—ฎ๐—ป ๐˜„๐—ฒ ๐—ฎ๐—ป๐—ผ๐—ป๐˜†๐—บ๐—ถ๐˜‡๐—ฒ ๐˜๐—ฒ๐˜…๐˜ ๐˜€๐—ผ ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ฐ๐—ฎ๐—ปโ€™๐˜ ๐—ฟ๐—ฒ-๐—ถ๐—ฑ๐—ฒ๐—ป๐˜๐—ถ๐—ณ๐˜† ๐˜€๐—ฒ๐—ป๐˜€๐—ถ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป โ€” ๐˜„๐—ต๐—ถ๐—น๐—ฒ ๐—ฝ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ป๐—ด ๐˜‚๐˜๐—ถ๐—น๐—ถ๐˜๐˜† ๐—ณ๐—ผ๐—ฟ ๐—ฑ๐—ผ๐˜„๐—ป๐˜€๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ ๐˜๐—ฎ๐˜€๐—ธ๐˜€?

๐Ÿš€ ๐—ช๐—ฒ ๐—ถ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ฒ ๐—ฅ๐—จ๐—ฃ๐—ง๐—”: ๐—ฅ๐—ผ๐—ฏ๐˜‚๐˜€๐˜ ๐—จ๐˜๐—ถ๐—น๐—ถ๐˜๐˜†-๐—ฃ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ฟ๐˜ƒ๐—ถ๐—ป๐—ด ๐—ง๐—ฒ๐˜…๐˜ ๐—”๐—ป๐—ผ๐—ป๐˜†๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป.

(1/๐Ÿงต)

โœ… ๐—ฅ๐—จ๐—ฃ๐—ง๐—” uses LLMs to:
โ†’ ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ฒ ๐—ฝ๐—ฟ๐—ถ๐˜ƒ๐—ฎ๐—ฐ๐˜† ๐—ฟ๐—ถ๐˜€๐—ธ via simulated re-identification attacks (privacy evaluator).
โ†’ ๐— ๐—ฒ๐—ฎ๐˜€๐˜‚๐—ฟ๐—ฒ ๐˜‚๐˜๐—ถ๐—น๐—ถ๐˜๐˜† ๐—ฟ๐—ฒ๐˜๐—ฒ๐—ป๐˜๐—ถ๐—ผ๐—ป for tasks like classification (utility evaluator).
โ†’ ๐—œ๐˜๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ๐—น๐˜† ๐—ผ๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฒ ๐˜๐—ฒ๐˜…๐˜ via lexicographic optimization: prioritize privacy, then maximize utility.

(2/๐Ÿงต)

โš™๏ธ Supports ๐—ฐ๐˜‚๐˜€๐˜๐—ผ๐—บ๐—ถ๐˜‡๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฝ๐—ฟ๐—ถ๐˜ƒ๐—ฎ๐—ฐ๐˜†-๐˜‚๐˜๐—ถ๐—น๐—ถ๐˜๐˜† ๐˜๐—ฟ๐—ฎ๐—ฑ๐—ฒ-๐—ผ๐—ณ๐—ณ๐˜€ and distillation into lightweight models for real-time use.

๐Ÿ“Š ๐—ข๐˜‚๐˜๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐˜€ ๐—ฝ๐—ฟ๐—ถ๐—ผ๐—ฟ ๐—บ๐—ฒ๐˜๐—ต๐—ผ๐—ฑ๐˜€, achieving lower re-identification success rates and higher downstream accuracy on DB-bio and PersonalReddit datasets.

(3/๐Ÿงต)

Robust Utility-Preserving Text Anonymization Based on Large Language Models

Anonymizing text that contains sensitive information is crucial for a wide range of applications. Existing techniques face the emerging challenges of the re-identification ability of large language models (LLMs), which have shown advanced capability in memorizing detailed information and reasoning over dispersed pieces of patterns to draw conclusions. When defending against LLM-based re-identification, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks. In general, the interaction between anonymization and data utility requires a deeper understanding within the context of LLMs. In this paper, we propose a framework composed of three key LLM-based components: a privacy evaluator, a utility evaluator, and an optimization component, which work collaboratively to perform anonymization. Extensive experiments demonstrate that the proposed model outperforms existing baselines, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. We provide detailed studies on these core modules. To consider large-scale and real-time applications, we investigate the distillation of the anonymization capabilities into lightweight models. All of our code and datasets will be made publicly available at https://github.com/UKPLab/acl2025-rupta.

arXiv.org

Also consider following the authors Tianyu Yang (Ubiquitous Knowledge Processing (UKP) Lab, hessian.AI)โ€ฌ, Xiaodan Zhu (Department of Electrical and Computer Engineering & Ingenuity Labs Research Institute, Queen's University), and Iryna Gurevych (Ubiquitous Knowledge Processing (UKP) Lab).

(5/5)

#NLProc #ACL2025 #TextAnonymization #LLMSafety #AIPrivacy