David Silver, creator of AlphaGo and AlphaZero, has raised a record $1.1 billion seed round for Ineffable Intelligence — the largest ever in Europe. The startup aims to build superintelligence using reinforcement learning alone, with no human-generated data.
#reinforcement-learning
RSS Feedr/singularity upvoted the round less because of venture spectacle and more because David Silver’s name still means AlphaZero-era reinforcement learning. The discussion centered on whether a “superlearner” trained without human data could become a genuinely different path from today’s web-trained LLM stack.
Investors just placed another billion-dollar bet on an AI path that tries to move beyond human-written data. David Silver’s new lab, Ineffable Intelligence, raised $1.1 billion to pursue reinforcement-learning systems it calls “superlearners.”
This is material because one of reinforcement learning’s best-known researchers has broken out with one of Europe’s biggest seed rounds instead of another incremental model demo. Reuters says Ineffable opened with $1.1 billion at a $5.1 billion valuation, while the company frames the mission as building “superlearners” from experience rather than human data.
Why it matters: post-training agents increasingly depend on reinforcement learning throughput, not only inference speed. NVIDIA says NeMo RL’s FP8 path speeds RL workloads by 1.48x on Qwen3-8B-Base while tracking BF16 accuracy.
RAD-2 reframes diffusion-based driving planners as a generator-discriminator system, then adds reinforcement learning feedback where imitation-only training is weakest. The headline number is a 56% collision-rate drop versus strong diffusion planners, plus reported real-world deployment in complex urban traffic.
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.
Together Research said on March 31, 2026 that Aurora is an open-source framework for adaptive speculative decoding that learns from live inference traces and updates the speculator asynchronously without interrupting serving. Together’s blog and paper say Aurora reframes the problem as asynchronous RL and can deliver 1.25x additional speedup over a strong static speculator as traffic shifts.
A March 28 essay on the Hamilton-Jacobi-Bellman equation drew Hacker News attention by showing how continuous-time control theory connects reinforcement learning, optimal control, and diffusion models.
A March 2026 Hacker News thread with 120 points and 33 comments pushed a deep technical explainer on the Hamilton-Jacobi-Bellman equation. The post argues that continuous-time reinforcement learning and diffusion models can be understood through the same control-theory structure rather than as separate ML tricks.
Cursor has published the Composer 2 technical report, outlining its code-focused continued pretraining, large-scale reinforcement learning pipeline, and CursorBench-led evaluation strategy. The report offers an unusually detailed first-party look at how a production coding agent is trained and measured.
A March 29 r/singularity thread amplified Cursor's claim that Composer checkpoints can now be trained from live user interactions and shipped every five hours, with reward-hacking fixes treated as part of the story rather than an afterthought.