Japan just made one of the most contrarian bets in AI. Instead of spending billions to train a model that beats GPT-5.5 and Claude Opus 4.8, Tokyo’s Sakana AI built a model whose entire job is to orchestrate them. Meet Sakana Fugu — launched June 22, 2026 — an LLM trained to call other LLMs.
Key takeaways
- Sakana Fugu is an “orchestration model” — it routes each task to a coordinated team of frontier models (GPT-5.5, Claude Opus 4.8, Gemini 3.1 Pro…) instead of answering everything itself.
- Two versions: Fugu (fast, everyday) and Fugu Ultra (hardest, multi-step problems).
- Fugu Ultra posts the top score on 10 of 11 benchmarks — beating Opus 4.8 and GPT-5.5 on SWE-Bench Pro (73.7), TerminalBench, LiveCodeBench and Humanity’s Last Exam (Sakana’s own numbers).
- OpenAI-compatible API; subscriptions at $20 / $100 / $200 per month. Not available in the EU/EEA yet.
- The big question: a genuine breakthrough in coordination, or “just a router”? We break down both sides.
- What is Sakana Fugu?
- How the orchestration actually works
- A worked example: one hard query, start to finish
- Fugu vs Fugu Ultra
- The benchmarks — and the honest caveat
- Which models does it orchestrate?
- Pricing
- Using Fugu: a drop-in OpenAI-compatible API
- Who is behind Sakana AI?
- Fugu in context: Japan’s 2026 AI surge
- Breakthrough — or “just a wrapper”?
- Fugu vs rolling your own (or a router like OpenRouter)
- Why it matters
- Limitations to keep in mind
- Frequently asked questions
- The bottom line
What is Sakana Fugu?
Sakana Fugu is not a traditional foundation model. It’s a conductor — a learned system whose specialty is deciding which other AI models should handle your request, and how. The name is a wink: fugu is the pufferfish delicacy that only an expert can prepare safely. The implication is that orchestrating powerful models is itself a craft.
When you send a query to the single, OpenAI-compatible Fugu endpoint, the model decides internally: answer directly when it can (simple questions, low latency), or assemble and coordinate a team of expert models when the task is hard. Model selection, delegation, verification and final synthesis all happen inside the system and stay invisible to you. As Sakana puts it, the per-query routing is proprietary — you see one answer, not the committee behind it.
How the orchestration actually works
Under the hood, Fugu runs a loop that looks roughly like: route → delegate → verify → synthesize. It’s built on two papers Sakana published at ICLR 2026:
- TRINITY — a lightweight, evolutionarily optimized coordinator that works across several turns, assigning Thinker, Worker, or Verifier roles to delegate work adaptively.
- Conductor — a system trained with reinforcement learning to discover natural-language coordination strategies and focused prompts for a diverse pool of LLMs.
That distinction matters: Fugu is not a dumb if-then router. It’s a coordinator that has been optimized — through evolution and RL — to decide who does what, to double-check answers with a verifier role, and to stitch the pieces into one response. Whether that optimization holds up outside Sakana’s own evaluations is the open question we return to below.
A worked example: one hard query, start to finish
Imagine you ask Fugu Ultra to “refactor this 800-line Python service to async and fix the race condition in the connection pool.” Behind the single response you receive, the choreography looks roughly like this:
- Route: Fugu recognizes this is a hard, multi-step coding task rather than a one-liner, so it convenes a team instead of answering directly.
- Thinker: a strong reasoning model is assigned to plan the refactor and locate the race condition conceptually.
- Worker: a coding-specialized model writes the actual async implementation from that plan.
- Verifier: a third model checks the diff against the original intent — does it preserve behavior? did it actually fix the race? — and flags anything wrong.
- Synthesize: Fugu reconciles the verifier’s notes, requests a correction if needed, and returns one clean answer.
You never see the hand-offs. That’s the entire pitch: the rigor of a careful three-model review, delivered as if it came from a single assistant. The cost, naturally, is that several models ran where one might have done — which is exactly why Fugu’s router tries to answer simple questions itself and reserve the full committee for problems that warrant it.
Fugu vs Fugu Ultra
| Aspect | Fugu | Fugu Ultra |
|---|---|---|
| Built for | Everyday coding, code review, chatbots | Hard, multi-step problems where accuracy is critical |
| Priority | Strong performance + low latency | Maximum answer quality |
| Agent pool | Lean; can opt out of specific agents (compliance) | Deeper pool of expert agents; no opt-out |
| Model ID | fugu | fugu-ultra-20260615 |
The opt-out matters for businesses: with Fugu you can exclude particular models from the pool (say, to keep data away from a given provider), but Fugu Ultra trades that control for maximum quality.
The benchmarks — and the honest caveat
Sakana’s published comparison puts Fugu Ultra ahead of the frontier on coding and reasoning:
| Benchmark | Fugu Ultra | Opus 4.8 | Gemini 3.1 Pro | GPT-5.5 |
|---|---|---|---|---|
| SWE-Bench Pro | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench | 93.2 | 87.8 | 88.5 | 85.3 |
| Humanity’s Last Exam | 50.0 | 49.8 | 44.4 | 41.4 |
Sakana says Fugu Ultra “posts the top score on 10 of 11 rows.” Two caveats keep this honest: (1) these are the vendor’s own numbers — independent testing hasn’t caught up to the launch yet; and (2) an orchestrator beating the models it orchestrates is less surprising than it sounds, because it can pick the best model for each individual task. The real-world tests that matter are cost, latency, and reliability under load — not just a leaderboard.
Which models does it orchestrate?
Sakana does not publicly list the pool — routing is proprietary. Press coverage points to GPT-5.5, Claude Opus 4.8 and Gemini 3.1 Pro among the orchestrated models. Interestingly, Sakana notes that Claude Fable 5 and Mythos Preview are not in Fugu’s pool, since they aren’t publicly accessible via API. If you want to understand the components Fugu is conducting, our AI models database has full specs and pricing for each, and our Claude Opus 4.8 vs GPT-5.5 comparison shows how they differ.
Pricing
Fugu is sold as a subscription, not pure pay-as-you-go: $20/month (Standard), $100/month (Pro), and $200/month (Max), each covering both Fugu and Fugu Ultra with different usage limits. Token usage and cost are reported per request through the OpenAI-compatible API (endpoints at console.sakana.ai). One thing to weigh: with an orchestrator you’re paying for the coordination layer on top of whatever the underlying models would cost — so the value depends on Fugu extracting enough extra quality to justify the overhead.
Using Fugu: a drop-in OpenAI-compatible API
Part of why Fugu is easy to try is that it speaks the OpenAI API dialect. If your code already calls OpenAI, you swap the base URL and the model name and you’re essentially done:
from openai import OpenAI
client = OpenAI(base_url="https://console.sakana.ai/v1", api_key="YOUR_KEY")
resp = client.chat.completions.create(
model="fugu-ultra-20260615",
messages=[{"role": "user", "content": "Explain and fix this bug..."}],
)
print(resp.choices[0].message.content)Token usage and cost are reported back per request, so you can see what a given query consumed — even though you can’t see which underlying models ran it. For teams in regulated environments, the standard Fugu tier’s ability to opt specific agents out of the pool is the feature that makes orchestration palatable: you can keep a given provider out of the loop entirely. Fugu Ultra trades that control away for maximum quality.
Who is behind Sakana AI?
Sakana AI is a Tokyo-based lab founded in 2023 by Llion Jones — one of the co-authors of the original “Attention Is All You Need” Transformer paper — and David Ha, formerly of Google Brain. The company is known for nature-inspired and evolutionary approaches to AI (sakana means “fish,” evoking schools and swarms). Fugu fits that worldview neatly: intelligence emerging from the coordination of many models rather than from one ever-larger network.
Fugu in context: Japan’s 2026 AI surge
Fugu didn’t appear in a vacuum. Japan has spent 2026 building sovereign AI capability, much of it through METI and NEDO’s GENIAC program. The headline releases this year:
- Rakuten AI 3.0 (March 2026) — billed as Japan’s largest high-performance model, an roughly 700-billion-parameter mixture-of-experts system optimized for Japanese and released openly under Apache 2.0.
- SoftBank / SB Intuitions “Sarashina” — a homegrown 460-billion-parameter Japanese LLM, now exposed through a commercial Sarashina API (plus a lightweight “Sarashina mini” for businesses), trained on a 4,000-GPU NVIDIA B200 cluster.
- NTT “tsuzumi 2” — tuned for a strong efficiency-to-performance balance, aimed at enterprise deployment on modest hardware.
Against that backdrop of large, Japanese-optimized foundation models, Sakana’s bet stands out precisely because it’s the opposite: not another big model, but a layer that makes the world’s best models work together. It’s a distinctly Sakana move — and a reminder that Japan’s AI strategy is far broader than any single lab.
Breakthrough — or “just a wrapper”?
Early community sentiment skews skeptical, and the dominant question is blunt: “Is this just a router around other people’s models?” It’s a fair challenge. Here are both sides:
- The skeptic case: Fugu owns no frontier model of its own. Strip away the branding and it’s a paid layer that calls APIs you could call yourself. If a provider changes pricing or access, Fugu’s economics shift overnight.
- The bull case: coordination may genuinely be the frontier. If a learned conductor reliably squeezes more out of existing models than any single one of them — verifying, retrying, and combining — that’s real value, and it sidesteps the trillion-dollar training arms race entirely.
The truth is probably in between, and it hinges on independent validation that hasn’t arrived yet.
Fugu vs rolling your own (or a router like OpenRouter)
The obvious objection is: can’t I just route between models myself, or use an aggregator like OpenRouter? You can — and that’s the bar Fugu has to clear. A manual setup or a price/latency router picks one model per call based on simple rules. Fugu’s claim is qualitatively different: on a single hard task it can use several models, assign them roles, have one verify another, and combine the results — coordination that is genuinely tedious to build and tune by hand. Whether that learned coordination beats a well-designed manual pipeline for your workload is, once again, the thing to test before you commit. For straightforward needs, a single strong model — or a simple router — remains the cheaper and more transparent choice.
Why it matters
Fugu crystallizes a trend we’ve been documenting: the marginal value of a bigger frontier model is shrinking, and the real leverage is matching the right model to each task. Our 2026 AI Price-Performance Index found that the frontier premium buys the last points of capability, not proportional value — and our open-vs-closed cost study showed how wide the price gap has become. Fugu automates exactly the decision those studies point to: which model should answer this question? If it works, it commoditizes “which AI should I use?” into a single endpoint.
Limitations to keep in mind
- Dependency: Fugu is only as good as the models in its pool — and your access to them.
- Cost stacking: you pay Sakana’s coordination layer on top of underlying model usage.
- Opacity: proprietary routing means you can’t always audit which model produced your answer (Fugu allows agent opt-out; Fugu Ultra does not).
- Availability: not offered in the EU/EEA pending GDPR compliance.
- Unproven at launch: independent benchmarks and real-world reliability are still catching up to the claims.
Frequently asked questions
Is Sakana Fugu a large language model? Sort of — it’s an orchestration model that uses other LLMs rather than generating every answer from a single network.
Does Fugu replace GPT-5.5 or Claude? No — it calls them. It’s a layer above the frontier models, not a competitor to them in the usual sense.
Can I run Fugu locally? No. It’s a cloud API that depends on access to frontier model providers.
Is it open source? The product is proprietary, but the underlying research (TRINITY and Conductor) was published at ICLR 2026.
How is it different from a normal router? A typical router uses fixed rules. Fugu is a learned coordinator — optimized with evolution and reinforcement learning — that assigns roles, verifies outputs, and synthesizes a final answer.
The bottom line
Sakana Fugu is the most interesting AI launch of June 2026 — not because it’s the smartest model, but because it reframes the question. Instead of “which model is best?”, Fugu asks “what if you didn’t have to choose?” Whether it proves to be a genuine paradigm shift or a clever wrapper, it captures a real shift in where AI value lives: less in any single model, more in how you coordinate them. The benchmarks look striking; now we wait for the independent tests to confirm — or puncture — the hype.
Sources: Sakana AI launch materials and benchmark table; ICLR 2026 TRINITY and Conductor papers; reporting by MarkTechPost, Nikkei Asia and GIGAZINE. Figures as published June 2026.
