Lightning OPD cuts reasoning-model post-training to 30 GPU hours

Lightning OPD is worth watching because it targets the bill that sits behind many reasoning-model papers: post-training infrastructure. Standard on-policy distillation keeps a live teacher inference server running while the student trains, which makes each experiment heavier than the loss function alone suggests. In an April 14 arXiv paper, Yecheng Wu, Song Han, and Hai Cai propose an offline version that removes that live teacher dependency.

The key idea is teacher consistency. The authors argue that the same teacher model must be used for both supervised fine-tuning and OPD. If that condition is broken, they show that gradient bias appears and can push both online and offline OPD toward a suboptimal fixed point. Lightning OPD precomputes teacher log-probabilities over SFT rollouts while preserving that consistency, so the training run no longer needs an active teacher server.

The result that will draw attention is the efficiency claim. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in 30 GPU hours. The paper reports a 4.0x speedup over standard OPD, with experiments spanning mathematical reasoning and code generation. It also argues that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD while introducing a bounded gradient discrepancy.

If the finding holds up outside the paper’s setup, it changes who can afford to experiment with reasoning post-training. Smaller labs may not need to keep a high-end teacher model serving throughout every run, and open-model work could iterate faster on specialized domains. The next things to check are code availability, behavior on base models beyond Qwen3-8B-Base, and whether the 30 GPU hours result scales cleanly to longer curricula. Source: arXiv:2604.13010.

The constraint is that offline OPD is not a generic shortcut by itself. The paper’s warning is almost the point: precomputing probabilities only works when teacher choice is controlled across SFT and OPD. That makes the method interesting for reproducible pipelines, because the saved teacher outputs become part of the training artifact.

Lightning OPD cuts reasoning-model post-training to 30 GPU hours

Related Articles

Anthropic Identifies Industrial-Scale Model Distillation Attacks by DeepSeek, Moonshot AI, and MiniMax

Nature paper shows LLM traits can pass through hidden data signals

Hacker News Examines NanoGPT Slowrun's 10x Data-Efficiency Claim

Related Articles

Anthropic Identifies Industrial-Scale Model Distillation Attacks by DeepSeek, Moonshot AI, and MiniMax
LLM Reddit Feb 24, 2026 1 min read

Nature paper shows LLM traits can pass through hidden data signals
LLM X/Twitter Apr 16, 2026 1 min read

Hacker News Examines NanoGPT Slowrun's 10x Data-Efficiency Claim
LLM Hacker News Mar 20, 2026 2 min read