Sakana Fugu Opens Beta With 54.2 SWE-Pro and OpenAI-Style API
Original: We’re launching the beta for our new commercial AI product: Sakana Fugu, a multi-agent orchestration system! View original →
Sakana AI is moving multi-agent orchestration out of the lab demo phase and into a commercial API, which matters because most teams still wire model routing together by hand. In a new X post, the Tokyo lab says Sakana Fugu is entering beta as a system that can choose and coordinate frontier models automatically instead of forcing developers to manage separate providers, API keys, and brittle prompt logic.
“We’re launching the beta for our new commercial AI product: Sakana Fugu, a multi-agent orchestration system,” the team wrote on X, adding that Fugu hit SOTA on SWE-Pro, GPQA-D, and ALE-Bench.
The linked official blog post provides the harder numbers. Sakana says fugu-ultra reaches 95.1 on GPQAD, 93.2 on LCBv6, and 54.2 on SWEPro. In the same table, Gemini 3.1 high scores 94.4 on GPQAD and GPT 5.4 high scores 51.2 on SWEPro, while Anthropic’s cited Opus 4.6 max score on SWEPro is 53.4. Sakana is also pitching the product as easy to slot into existing stacks: the beta uses OpenAI-format endpoints and comes in two modes, fugu-mini for lower latency and fugu-ultra for heavier reasoning work.
The Sakana AI account usually uses X to turn its research agenda into concrete product or benchmark claims, and this post fits that pattern. The company has spent the last year arguing that the most capable systems will be coordinated collections of models rather than one giant endpoint. The Fugu release ties the product directly to two ICLR 2026 papers, Trinity and Conductor, which frame orchestration itself as something a small controller model can learn. One notable detail from the blog: Sakana says Fugu can recursively call itself, turning orchestration depth into a test-time compute dial instead of a fixed workflow.
What to watch next is whether outside beta users can reproduce the benchmark edge and whether Sakana discloses pricing, model-pool composition, and failure cases as the test expands. If those scores hold up in real coding and scientific workflows, Fugu becomes more than another wrapper on frontier APIs. It becomes a live test of whether orchestration can be sold as a model category of its own.
Related Articles
IBM Research’s VAKRA moves agent evaluation from static Q&A into executable tool environments. With 8,000+ locally hosted APIs across 62 domains and 3-7 step reasoning chains, the benchmark finds a gap between surface tool use and reliable enterprise agents.
LocalLLaMA did not just vent about weaker models; the thread turned the feeling into questions about provider routing, quantization, peak-time behavior, and how to prove a silent downgrade. The evidence is not settled, but the anxiety is real.
The r/singularity thread did not just react to Opus 4.7 scoring 41.0% where Opus 4.6 scored 94.7%. The interesting part was the community trying to separate real capability loss from refusal behavior, routing, and benchmark interpretation.
Comments (0)
No comments yet. Be the first to comment!