Together Research releases Aurora for RL-based adaptive speculative decoding

Original: New from Together Research: Aurora. Speculative decoding that adapts to shifting traffic in real time — and keeps improving the longer it runs. Open-source, RL-based, 1.25x faster vs. a well-trained static speculator with no offline retraining pipeline. Thread 🧵 View original →

Read in other languages: 한국어日本語
LLM Apr 1, 2026 By Insights AI 2 min read 1 views Source

What Together announced

On March 31, 2026, Together Research introduced Aurora, an open-source system for speculative decoding that is designed to keep adapting after deployment. The company framed the release around a practical production problem: draft models that speed up inference often go stale as traffic changes, while offline retraining is too slow to keep up.

That framing matters. Speculative decoding has become a standard optimization lever for large-model serving, but much of the ecosystem still treats the draft model as a static artifact trained offline. Together is arguing that the real challenge is not just building a better speculator once, but keeping it aligned with live workloads over time.

How Aurora works

Together’s blog describes Aurora as a serve-to-train flywheel powered by reinforcement learning. The system learns directly from live inference traces rather than waiting for a separate offline pipeline. In the company’s description, the inference server runs speculative decoding with a target model and a draft model, while an asynchronous training server uses accepted and rejected token proposals to improve the speculator and hot-swap updated weights back into service without interruption.

  • The paper says Aurora reframes online speculator learning as an asynchronous RL problem.
  • Accepted tokens become positive feedback, while rejected proposals provide implicit negative feedback.
  • The system integrates an SGLang-based inference server with an asynchronous training server and supports hot-swapped updates during serving.

What the paper adds

The arXiv paper presents Aurora as a unified training-serving system rather than a standalone model-training method. The authors argue that decoupling speculator training from serving creates three production problems: high time-to-serve, delayed utility feedback, and performance degradation when traffic distributions shift. Their answer is day-0 deployment with online adaptation.

On metrics, Together’s blog highlights 1.25x additional speedup over a well-trained static speculator when traffic patterns change. The paper also reports 1.5x day-0 speedup on recently released frontier models such as MiniMax M2.1 229B and Qwen3-Coder-Next 80B, while the release bundle includes a blog post, arXiv paper, and open-source code.

Why this matters

The practical significance is that inference optimization is starting to look like a continual-learning systems problem, not just a one-time model-compression problem. If speculative decoding quality depends on traffic mix, then vendors with strong serving telemetry and rapid update loops may gain an advantage over teams relying on slower offline retraining cycles.

An inference from the release is that Together wants to move the center of gravity from benchmark-style speculation gains to production-adaptive serving. Whether Aurora’s gains hold across more models and infrastructure stacks will still need broader validation, but the release is high-signal because it combines a concrete systems claim, open-source code, and a paper that explicitly connects RL training with live deployment economics.

Sources: Together AI X post · Together Research blog · Aurora paper · Aurora code

Share: Long

Related Articles

LLM Hacker News 6d ago 2 min read

ngrok’s March 25, 2026 explainer lays out how quantization can make LLMs roughly 4x smaller and 2x faster, and what the real 4-bit versus 8-bit tradeoff looks like. Hacker News drove the post to 247 points and 46 comments, reopening the discussion around memory bottlenecks and the economics of local inference.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.