Together Research releases Aurora for RL-based adaptive speculative decoding
Original: New from Together Research: Aurora. Speculative decoding that adapts to shifting traffic in real time — and keeps improving the longer it runs. Open-source, RL-based, 1.25x faster vs. a well-trained static speculator with no offline retraining pipeline. Thread 🧵 View original →
What Together announced
On March 31, 2026, Together Research introduced Aurora, an open-source system for speculative decoding that is designed to keep adapting after deployment. The company framed the release around a practical production problem: draft models that speed up inference often go stale as traffic changes, while offline retraining is too slow to keep up.
That framing matters. Speculative decoding has become a standard optimization lever for large-model serving, but much of the ecosystem still treats the draft model as a static artifact trained offline. Together is arguing that the real challenge is not just building a better speculator once, but keeping it aligned with live workloads over time.
How Aurora works
Together’s blog describes Aurora as a serve-to-train flywheel powered by reinforcement learning. The system learns directly from live inference traces rather than waiting for a separate offline pipeline. In the company’s description, the inference server runs speculative decoding with a target model and a draft model, while an asynchronous training server uses accepted and rejected token proposals to improve the speculator and hot-swap updated weights back into service without interruption.
- The paper says Aurora reframes online speculator learning as an asynchronous RL problem.
- Accepted tokens become positive feedback, while rejected proposals provide implicit negative feedback.
- The system integrates an SGLang-based inference server with an asynchronous training server and supports hot-swapped updates during serving.
What the paper adds
The arXiv paper presents Aurora as a unified training-serving system rather than a standalone model-training method. The authors argue that decoupling speculator training from serving creates three production problems: high time-to-serve, delayed utility feedback, and performance degradation when traffic distributions shift. Their answer is day-0 deployment with online adaptation.
On metrics, Together’s blog highlights 1.25x additional speedup over a well-trained static speculator when traffic patterns change. The paper also reports 1.5x day-0 speedup on recently released frontier models such as MiniMax M2.1 229B and Qwen3-Coder-Next 80B, while the release bundle includes a blog post, arXiv paper, and open-source code.
Why this matters
The practical significance is that inference optimization is starting to look like a continual-learning systems problem, not just a one-time model-compression problem. If speculative decoding quality depends on traffic mix, then vendors with strong serving telemetry and rapid update loops may gain an advantage over teams relying on slower offline retraining cycles.
An inference from the release is that Together wants to move the center of gravity from benchmark-style speculation gains to production-adaptive serving. Whether Aurora’s gains hold across more models and infrastructure stacks will still need broader validation, but the release is high-signal because it combines a concrete systems claim, open-source code, and a paper that explicitly connects RL training with live deployment economics.
Sources: Together AI X post · Together Research blog · Aurora paper · Aurora code
Related Articles
The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.
A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.