LLMs Corrupt Your Documents When You Delegate

본 논문은 LLM이 문서 편집을 위임받아 수행할 때 문서 내용이 손상되는 문제를 다룬다. DELEGATE-52라는 52개 전문 분야를 아우르는 장기 위임 워크플로우 시뮬레이션을 통해 19개 LLM을 평가한 결과, 최신 모델조차도 평균 25%의 문서 내용을 손상시키는 것으로 나타났다. 에이전트 도구 사용이나 문서 크기, 상호작용 길이, 방해 파일 존재 등이 손상 정도를 악화시키며, LLM은 장기 작업에서 신뢰할 수 없는 위임자임을 시사한다. 이는 LLM 기반 문서 자동화 및 에이전트 구축 시 신뢰성과 오류 누적 문제에 대한 주의가 필요함을 의미한다.

https://arxiv.org/abs/2604.15597

#llm #documentcorruption #delegation #aiagents #workflow

LLMs Corrupt Your Documents When You Delegate

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

arXiv.org
LLMs Corrupt Your Documents When You Delegate

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

arXiv.org