Together Research releases Aurora for RL-based adaptive speculative decoding

What Together announced

On March 31, 2026, Together Research introduced Aurora, an open-source system for speculative decoding that is designed to keep adapting after deployment. The company framed the release around a practical production problem: draft models that speed up inference often go stale as traffic changes, while offline retraining is too slow to keep up.

That framing matters. Speculative decoding has become a standard optimization lever for large-model serving, but much of the ecosystem still treats the draft model as a static artifact trained offline. Together is arguing that the real challenge is not just building a better speculator once, but keeping it aligned with live workloads over time.

How Aurora works

Together’s blog describes Aurora as a serve-to-train flywheel powered by reinforcement learning. The system learns directly from live inference traces rather than waiting for a separate offline pipeline. In the company’s description, the inference server runs speculative decoding with a target model and a draft model, while an asynchronous training server uses accepted and rejected token proposals to improve the speculator and hot-swap updated weights back into service without interruption.

The paper says Aurora reframes online speculator learning as an asynchronous RL problem.
Accepted tokens become positive feedback, while rejected proposals provide implicit negative feedback.
The system integrates an SGLang-based inference server with an asynchronous training server and supports hot-swapped updates during serving.

What the paper adds

The arXiv paper presents Aurora as a unified training-serving system rather than a standalone model-training method. The authors argue that decoupling speculator training from serving creates three production problems: high time-to-serve, delayed utility feedback, and performance degradation when traffic distributions shift. Their answer is day-0 deployment with online adaptation.

On metrics, Together’s blog highlights 1.25x additional speedup over a well-trained static speculator when traffic patterns change. The paper also reports 1.5x day-0 speedup on recently released frontier models such as MiniMax M2.1 229B and Qwen3-Coder-Next 80B, while the release bundle includes a blog post, arXiv paper, and open-source code.

Why this matters

The practical significance is that inference optimization is starting to look like a continual-learning systems problem, not just a one-time model-compression problem. If speculative decoding quality depends on traffic mix, then vendors with strong serving telemetry and rapid update loops may gain an advantage over teams relying on slower offline retraining cycles.

An inference from the release is that Together wants to move the center of gravity from benchmark-style speculation gains to production-adaptive serving. Whether Aurora’s gains hold across more models and infrastructure stacks will still need broader validation, but the release is high-signal because it combines a concrete systems claim, open-source code, and a paper that explicitly connects RL training with live deployment economics.

Sources: Together AI X post · Together Research blog · Aurora paper · Aurora code

Together Research releases Aurora for RL-based adaptive speculative decoding

What Together announced

How Aurora works

What the paper adds

Why this matters

Related Articles

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference

LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical

Related Articles

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding
LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA Tests DFlash on Apple Silicon and Reports 2x-3x Faster Qwen Inference
LLM Reddit Apr 11, 2026 2 min read

LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical
LLM Reddit Apr 28, 2026 2 min read