Study: LLMs Silently Corrupt 25% of Documents in Delegated Workflows

Overview

A paper from Microsoft Research, LLMs Corrupt Your Documents When You Delegate, reveals a fundamental flaw in delegating long-form work to AI assistants. When users hand off complex document editing tasks to LLMs, the models silently introduce errors that accumulate over time.

The DELEGATE-52 Benchmark

Researchers created DELEGATE-52 to simulate extended delegated workflows across 52 professional domains including coding, crystallography, and music notation. Testing 19 LLMs revealed sobering results:

Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.
Lower-tier models fail far more severely.
Agentic tool use does not improve performance on DELEGATE-52.
Degradation worsens with document size, interaction length, and the presence of distractor files.

Why This Matters

The danger is in the silence. LLMs confidently edit documents while introducing sparse but severe errors that compound over a long session. As the AI industry pushes toward agentic paradigms, this study suggests current LLMs are not ready to be trusted delegates. The authors propose DELEGATE-52 as a public benchmark to track AI readiness for delegated work.

LLM Reddit 4d ago 1 min read

DeepSeek V4 Pro Matches GPT-5.2 on Agentic Benchmark — 17x Cheaper, 10 Weeks Later

DeepSeek V4 Pro tied with GPT-5.2 on FoodTruck Bench, a 30-day agentic benchmark using 34 tools, arriving roughly 10 weeks after GPT-5.2 was tested at approximately 17x lower cost.

#deepseek #benchmark #llm

LLM X/Twitter Apr 16, 2026 1 min read

Nature paper shows LLM traits can pass through hidden data signals

Synthetic-data training has a sharper safety problem than obvious bad examples. A Nature paper co-authored by Anthropic researchers reports that traits such as owl preference or misalignment can move through semantically unrelated number sequences.

#ai-safety #llm #distillation

LLM Hacker News Apr 15, 2026 2 min read

HN is stress-testing I-DLM, a diffusion LLM that says it can keep AR quality

HN reacted fast because I-DLM is not selling faster text generation someday; it is claiming diffusion-style decoding can keep pace with autoregressive quality now. The thread quickly turned into a reality check on whether the 2.9x-4.1x throughput story can survive real inference stacks.

#llm #diffusion #inference