Together Research releases Aurora for RL-based adaptive speculative decoding
Original: New from Together Research: Aurora. Speculative decoding that adapts to shifting traffic in real time — and keeps improving the longer it runs. Open-source, RL-based, 1.25x faster vs. a well-trained static speculator with no offline retraining pipeline. Thread 🧵 View original →
What Together announced
On March 31, 2026, Together Research introduced Aurora, an open-source system for speculative decoding that is designed to keep adapting after deployment. The company framed the release around a practical production problem: draft models that speed up inference often go stale as traffic changes, while offline retraining is too slow to keep up.
That framing matters. Speculative decoding has become a standard optimization lever for large-model serving, but much of the ecosystem still treats the draft model as a static artifact trained offline. Together is arguing that the real challenge is not just building a better speculator once, but keeping it aligned with live workloads over time.
How Aurora works
Together’s blog describes Aurora as a serve-to-train flywheel powered by reinforcement learning. The system learns directly from live inference traces rather than waiting for a separate offline pipeline. In the company’s description, the inference server runs speculative decoding with a target model and a draft model, while an asynchronous training server uses accepted and rejected token proposals to improve the speculator and hot-swap updated weights back into service without interruption.
- The paper says Aurora reframes online speculator learning as an asynchronous RL problem.
- Accepted tokens become positive feedback, while rejected proposals provide implicit negative feedback.
- The system integrates an SGLang-based inference server with an asynchronous training server and supports hot-swapped updates during serving.
What the paper adds
The arXiv paper presents Aurora as a unified training-serving system rather than a standalone model-training method. The authors argue that decoupling speculator training from serving creates three production problems: high time-to-serve, delayed utility feedback, and performance degradation when traffic distributions shift. Their answer is day-0 deployment with online adaptation.
On metrics, Together’s blog highlights 1.25x additional speedup over a well-trained static speculator when traffic patterns change. The paper also reports 1.5x day-0 speedup on recently released frontier models such as MiniMax M2.1 229B and Qwen3-Coder-Next 80B, while the release bundle includes a blog post, arXiv paper, and open-source code.
Why this matters
The practical significance is that inference optimization is starting to look like a continual-learning systems problem, not just a one-time model-compression problem. If speculative decoding quality depends on traffic mix, then vendors with strong serving telemetry and rapid update loops may gain an advantage over teams relying on slower offline retraining cycles.
An inference from the release is that Together wants to move the center of gravity from benchmark-style speculation gains to production-adaptive serving. Whether Aurora’s gains hold across more models and infrastructure stacks will still need broader validation, but the release is high-signal because it combines a concrete systems claim, open-source code, and a paper that explicitly connects RL training with live deployment economics.
Sources: Together AI X post · Together Research blog · Aurora paper · Aurora code
Related Articles
ngrok’s March 25, 2026 explainer lays out how quantization can make LLMs roughly 4x smaller and 2x faster, and what the real 4-bit versus 8-bit tradeoff looks like. Hacker News drove the post to 247 points and 46 comments, reopening the discussion around memory bottlenecks and the economics of local inference.
A LocalLLaMA thread about Intel’s Arc Pro B70 and B65 reached 213 upvotes and 133 comments. Intel says the B70 is available from March 25, 2026 with a suggested starting price of $949, while the B65 follows in mid-April.
Google Research introduced TurboQuant on March 24, 2026 as a compression approach for KV cache and vector search bottlenecks. Hacker News pushed the post to 491 points and 129 comments, reflecting how central memory efficiency has become for long-context inference.
Comments (0)
No comments yet. Be the first to comment!