Lightning OPD cuts reasoning-model post-training to 30 GPU hours
Original: Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation View original →
Lightning OPD is worth watching because it targets the bill that sits behind many reasoning-model papers: post-training infrastructure. Standard on-policy distillation keeps a live teacher inference server running while the student trains, which makes each experiment heavier than the loss function alone suggests. In an April 14 arXiv paper, Yecheng Wu, Song Han, and Hai Cai propose an offline version that removes that live teacher dependency.
The key idea is teacher consistency. The authors argue that the same teacher model must be used for both supervised fine-tuning and OPD. If that condition is broken, they show that gradient bias appears and can push both online and offline OPD toward a suboptimal fixed point. Lightning OPD precomputes teacher log-probabilities over SFT rollouts while preserving that consistency, so the training run no longer needs an active teacher server.
The result that will draw attention is the efficiency claim. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in 30 GPU hours. The paper reports a 4.0x speedup over standard OPD, with experiments spanning mathematical reasoning and code generation. It also argues that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD while introducing a bounded gradient discrepancy.
If the finding holds up outside the paper’s setup, it changes who can afford to experiment with reasoning post-training. Smaller labs may not need to keep a high-end teacher model serving throughout every run, and open-model work could iterate faster on specialized domains. The next things to check are code availability, behavior on base models beyond Qwen3-8B-Base, and whether the 30 GPU hours result scales cleanly to longer curricula. Source: arXiv:2604.13010.
The constraint is that offline OPD is not a generic shortcut by itself. The paper’s warning is almost the point: precomputing probabilities only works when teacher choice is controlled across SFT and OPD. That makes the method interesting for reproducible pipelines, because the saved teacher outputs become part of the training artifact.
Related Articles
Anthropic is not only shipping a stronger Claude model; it is splitting the same base capability into a broad Fable release and a restricted Mythos track. The package includes $10/$50 token pricing, 30-day safety retention, and automatic fallback to Opus 4.8 for some high-risk requests.
Anthropic has accused Chinese AI firms of creating over 24,000 fraudulent accounts to extract 16 million training exchanges from Claude for model distillation.
Q Labs says 100M tokens and an 18B-parameter ensemble can match a 1B-token baseline, and Hacker News immediately focused on whether that gain survives serving and deployment.