Study: LLMs Silently Corrupt 25% of Documents in Delegated Workflows

Overview

A paper from Microsoft Research, LLMs Corrupt Your Documents When You Delegate, reveals a fundamental flaw in delegating long-form work to AI assistants. When users hand off complex document editing tasks to LLMs, the models silently introduce errors that accumulate over time.

The DELEGATE-52 Benchmark

Researchers created DELEGATE-52 to simulate extended delegated workflows across 52 professional domains including coding, crystallography, and music notation. Testing 19 LLMs revealed sobering results:

Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.
Lower-tier models fail far more severely.
Agentic tool use does not improve performance on DELEGATE-52.
Degradation worsens with document size, interaction length, and the presence of distractor files.

Why This Matters

The danger is in the silence. LLMs confidently edit documents while introducing sparse but severe errors that compound over a long session. As the AI industry pushes toward agentic paradigms, this study suggests current LLMs are not ready to be trusted delegates. The authors propose DELEGATE-52 as a public benchmark to track AI readiness for delegated work.

LLM Reddit 4d ago 1 min read

DeepSeek V4 Pro Matches GPT-5.2 on Agentic Benchmark — 17x Cheaper, 10 Weeks Later

DeepSeek V4 Pro tied with GPT-5.2 on FoodTruck Bench, a 30-day agentic benchmark using 34 tools, arriving roughly 10 weeks after GPT-5.2 was tested at approximately 17x lower cost.

#deepseek #benchmark #llm

LLM Reddit 1h ago 1 min read

Fields Medalist Timothy Gowers: GPT-5.5 Pro Produced PhD-Level Math Proofs — Research Faces 'Crisis'

Fields Medal-winning mathematician Timothy Gowers tested ChatGPT 5.5 Pro on open math problems and found it produced PhD-level proofs in about an hour, warning that mathematical research faces an imminent 'crisis' at the current rate of AI progress.

#chatgpt #mathematics #llm

LLM X/Twitter Apr 16, 2026 1 min read

Nature paper shows LLM traits can pass through hidden data signals

Synthetic-data training has a sharper safety problem than obvious bad examples. A Nature paper co-authored by Anthropic researchers reports that traits such as owl preference or misalignment can move through semantically unrelated number sequences.

#ai-safety #llm #distillation