Study: LLMs Silently Corrupt 25% of Documents in Delegated Workflows
Original: LLMs Corrupt Your Documents When You Delegate View original →
Overview
A paper from Microsoft Research, LLMs Corrupt Your Documents When You Delegate, reveals a fundamental flaw in delegating long-form work to AI assistants. When users hand off complex document editing tasks to LLMs, the models silently introduce errors that accumulate over time.
The DELEGATE-52 Benchmark
Researchers created DELEGATE-52 to simulate extended delegated workflows across 52 professional domains including coding, crystallography, and music notation. Testing 19 LLMs revealed sobering results:
- Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.
- Lower-tier models fail far more severely.
- Agentic tool use does not improve performance on DELEGATE-52.
- Degradation worsens with document size, interaction length, and the presence of distractor files.
Why This Matters
The danger is in the silence. LLMs confidently edit documents while introducing sparse but severe errors that compound over a long session. As the AI industry pushes toward agentic paradigms, this study suggests current LLMs are not ready to be trusted delegates. The authors propose DELEGATE-52 as a public benchmark to track AI readiness for delegated work.
Related Articles
DeepSeek V4 Pro tied with GPT-5.2 on FoodTruck Bench, a 30-day agentic benchmark using 34 tools, arriving roughly 10 weeks after GPT-5.2 was tested at approximately 17x lower cost.
Fields Medalist Timothy Gowers: GPT-5.5 Pro Produced PhD-Level Math Proofs — Research Faces 'Crisis'
Fields Medal-winning mathematician Timothy Gowers tested ChatGPT 5.5 Pro on open math problems and found it produced PhD-level proofs in about an hour, warning that mathematical research faces an imminent 'crisis' at the current rate of AI progress.
Synthetic-data training has a sharper safety problem than obvious bad examples. A Nature paper co-authored by Anthropic researchers reports that traits such as owl preference or misalignment can move through semantically unrelated number sequences.
Comments (0)
No comments yet. Be the first to comment!