Investors just placed another billion-dollar bet on an AI path that tries to move beyond human-written data. David Silver’s new lab, Ineffable Intelligence, raised $1.1 billion to pursue reinforcement-learning systems it calls “superlearners.”
#reinforcement-learning
RSS FeedThis is material because one of reinforcement learning’s best-known researchers has broken out with one of Europe’s biggest seed rounds instead of another incremental model demo. Reuters says Ineffable opened with $1.1 billion at a $5.1 billion valuation, while the company frames the mission as building “superlearners” from experience rather than human data.
Why it matters: post-training agents increasingly depend on reinforcement learning throughput, not only inference speed. NVIDIA says NeMo RL’s FP8 path speeds RL workloads by 1.48x on Qwen3-8B-Base while tracking BF16 accuracy.
RAD-2 reframes diffusion-based driving planners as a generator-discriminator system, then adds reinforcement learning feedback where imitation-only training is weakest. The headline number is a 56% collision-rate drop versus strong diffusion planners, plus reported real-world deployment in complex urban traffic.
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.
Together Research said on March 31, 2026 that Aurora is an open-source framework for adaptive speculative decoding that learns from live inference traces and updates the speculator asynchronously without interrupting serving. Together’s blog and paper say Aurora reframes the problem as asynchronous RL and can deliver 1.25x additional speedup over a strong static speculator as traffic shifts.
A March 28 essay on the Hamilton-Jacobi-Bellman equation drew Hacker News attention by showing how continuous-time control theory connects reinforcement learning, optimal control, and diffusion models.
A March 29 r/singularity thread amplified Cursor's claim that Composer checkpoints can now be trained from live user interactions and shipped every five hours, with reward-hacking fixes treated as part of the story rather than an afterthought.
A March 15, 2026 r/singularity post with 3,150 points and 376 comments pushed attention toward LATENT, a humanoid tennis system trained from five hours of imperfect human motion fragments instead of full match-grade capture.