LiveCodeBench vs SWE-Bench Pro — checking Sakana Fugu's "beats Fable" claim against what the benchmarks actually measure

About this post
Co-edited with Claude (Anthropic).

Summary

Fugu, released by Sakana AI on 2026-06-22, is not a single giant model. It is an orchestration system that bundles several frontier models behind one API

The “beats Fable” / “it won” headlines are overblown. The scores are all vendor (Sakana) numbers, with no third-party reproduction yet

Read benchmark by benchmark, the win is on LiveCodeBench (competitive-programming-style code generation). On the more practical SWE-Bench Pro, Fugu actually falls short

I like the idea of raising accuracy by orchestrating models, but if the things being bundled are general-purpose commercial APIs like GPT-5.5 / Opus / Gemini, that points away from the “local LLM” potential I was hoping for

The +19.43% return over 50 weeks of stock trading is interesting, but you have to talk about the net after the multi-agent token cost. Still, a Tokyo-based company entering the global AI debate is genuinely good

When Sakana AI released “Sakana Fugu” on 2026-06-22, my timeline filled up with tweets saying it “beat Fable” and “won”¹. My gut reaction, before looking into it, was: yes, the benchmark table is competitive, but claiming it “beat Fable” outright is probably inflated.

Cross-checking the official release against hands-on write-ups, what it actually surpasses is mainly LiveCodeBench, while on the more practical SWE-Bench Pro there are areas where it does not reach Fable²⁴. In other words, if you don’t look at which benchmark it won, you’ll mis-read this story.

In this post I’ll walk through what each benchmark measures, and use that ruler to sort Fugu’s wins, losses, and ties one by one. Along the way I’ll add my own take on the orchestration idea, the cost question behind the trading return, and why a Japanese company making LLM news is a good thing.

Fugu is not “one model” to begin with#

The first thing to get straight: Fugu is not a single large language model. It is a product that packages a multi-agent orchestration system — one that bundles several frontier models — behind a single OpenAI-compatible API¹³. The tagline is “One Model to Command Them All”¹.

Roughly, here is how it works¹³:

At the core is a 7B “coordinator” LLM trained with reinforcement learning. It is itself a language model, yet it is trained to call other LLMs in an agent pool (and can even call itself recursively)
Internally it dynamically assigns Thinker / Worker / Verifier roles to models depending on the task
The models in the pool are swappable. This avoids single-vendor lock-in and lets it route around a provider that becomes unavailable

The foundation is said to be two papers submitted to ICLR 2026: Trinity (a lightweight evolutionary coordinator) and Conductor (an RL method for learning coordination strategies between agents)¹³. The selling point, versus an external router, is that the orchestration is built into the model itself.

One caveat. Some articles name the pool’s component models as “GPT-5.5 / Claude Opus 4.8 / Gemini 3.1 Pro”³, but that comes from a hands-on write-up; the official release does not explicitly name specific models. Treat it as secondary information and discount it accordingly.

Once you hold this in mind, the benchmarks read differently. Fugu’s scores are not “the ability of one model” — they are “the combined result of bundling several frontier models behind the scenes.”

Reading each benchmark by what it actually measures#

This is the main point. Sakana claims “top score on 10 of 11 items”², but what “top” means is completely different from one benchmark to the next. Let’s go through five representative ones starting from what they measure. Note up front that every Fugu Ultra score here is a Sakana-published number²⁴.

SWE-Bench Pro — the closest thing to real-world code-fixing#

You hand the model an issue from a real GitHub repository, it actually edits the code, and it’s scored on whether the tests pass. In other words, it’s the closest proxy for “how much real bug-fixing and feature work can it complete on its own.” It’s one of the most-watched metrics for the practical usefulness of AI coding agents.

Fugu Ultra scores 73.7%². But it’s explicitly noted that in some areas it does not reach Fable 5⁴. So on the benchmark closest to real work, Fugu is not winning. To me this is the biggest crack in the “beats Fable” narrative.

LiveCodeBench — competitive-programming-style code generation#

You give it recent competitive-programming problems (LeetCode, AtCoder, Codeforces, etc.) and measure code correctness. The key design choice is that, to avoid training-data contamination, problems are partitioned by date so you can evaluate only on problems released after the model’s training cutoff. It leans toward “solving fresh problems” rather than “recalling memorized solutions.”

Fugu Ultra scores 93.2%, versus Fable 5 at 89.8²⁴. Here it clearly comes out ahead. The substance behind the “beats Fable” headline is, mostly, this benchmark.

Terminal-Bench 2.1 — agentic execution in a terminal#

This measures whether a model can carry a task to completion in a shell/terminal, issuing commands as it goes — environment setup, file operations, multi-step work. Close to autonomous agent ability.

Fugu Ultra scores 82.1, Fable 5 80.4⁴. Slightly ahead. Not quite within the margin of error, but a long way from “blowout.”

Humanity’s Last Exam — very hard expert knowledge and reasoning#

A set of very hard questions written by domain experts, designed to be genuinely tough for current models — a “last exam” past the ceiling of ordinary knowledge tests. Scores normally come out low.

Fugu Ultra scores 50.0%². The source doesn’t give a clear comparison figure, so I won’t call a win or loss here. Half is strong, but without the other side’s score it can’t be used for a head-to-head claim.

CharXiv Reasoning — reading figures in scientific papers#

You get the model to read and reason over graphs and figures from arXiv papers. It’s multimodal and tests the quietly difficult skill of “reading the fine details of a figure correctly.”

Fugu Ultra scores 86.6%, Mythos Preview 86.1²⁴. That’s effectively a tie. Hard to call a win.

Sorting table#

Rearranged by what each ruler measures:

Benchmark	What it measures	Fugu Ultra	Comparison	Verdict
SWE-Bench Pro	Resolving real-repo issues (closest to real work)	73.7%	Falls short of Fable 5 in places	Lean loss
LiveCodeBench	Code generation on fresh competitive problems	93.2%	Fable 5: 89.8	Win
Terminal-Bench 2.1	Agentic execution in a terminal	82.1	Fable 5: 80.4	Slight edge
Humanity’s Last Exam	Very hard expert knowledge/reasoning	50.0%	Comparison unknown	Withheld
CharXiv Reasoning	Reading/reasoning over scientific figures	86.6%	Mythos Preview: 86.1	Tie

Laid out this way, the only place you can plainly say “beats Fable” is LiveCodeBench — and on SWE-Bench Pro, the one that matters most for real work, it doesn’t reach Fable.

”Beat Fable” / “it won” really is inflated#

These numbers need one more layer of fine print.

First: these are all vendor-published (Sakana’s own) numbers, with no independent third-party reproduction yet⁵. Self-measured figures inherently leave room to pick favorable conditions.

Second: cropping a “won in practice too” out of results where it doesn’t even reach on SWE-Bench Pro is simply not accurate. The “top on 10 of 11 items” headline also lumps together the items it clearly wins and the ones where it’s merely a hair ahead — read benchmark by benchmark, the temperature changes a lot.

So the fair conclusion is this. Fugu is not “strongest across the board” — it “bundles several frontier models to pull up alongside the Fable / Mythos class”⁴. That’s an impressive achievement, but it’s a different claim from “surpassed.” The “it won” tweets are what you get from grabbing the headline without reading what the benchmarks contain.

I like the orchestration idea, but it nags at me#

From here, personal opinion. I actually quite like the orchestration idea itself — raising accuracy by bundling multiple models. Because it connects directly to a possibility: if you combine open models so they cover each other’s individual weaknesses, maybe local LLMs can get close to the frontier too. I wrote earlier about the electricity cost and break-even of local LLMs, and if there’s a future where the “local you run for freedom and learning” gets bundled into something smarter, that genuinely excites me.

But if what’s inside Fugu is “bundling GPT-5.5 / Opus / Gemini” (secondary info, but still), then it’s also just three general-purpose commercial APIs bundled together — the opposite direction from local freedom. Indeed, abroad it’s been mocked as “basically three LLMs trying to pass as a frontier model”⁶. This is where it nags at me a bit. If the things being bundled are ultimately other people’s APIs, that’s a different picture from the “lifting local up” I was hoping for.

The saving grace is that the pool is designed to be swappable³. If the mechanism can bundle external APIs, then in principle it could slot in open or local models too. The possibility is still there. More than Fugu itself, it’s the “learned orchestration” framework that I’m still hopeful about.

On the “single model vs orchestration” line itself, supporters counter that “frontier models are internally MoE (Mixture of Experts), a collection of sub-models, so the distinction is already meaningless”⁶. There’s something to that, and the debate probably lands around “both sides are technically right.”

+19.43% over 50 weeks of trading — but is the net positive?#

Another thing that caught my eye: the real-world test. Fugu reportedly produced a +19.43% return on a 50-week stock-trading pipeline⁴. Together with solving a Rubik’s cube in 19 steps⁵, it’s a genuinely fun demo.

But a trading return is “gross,” not “take-home.” Fugu Ultra, by its multi-agent structure, consumes a lot of tokens. A hands-on write-up reports responses taking 11–269 seconds depending on task complexity, with one code-generation run consuming 26,404 tokens³. Run 50 weeks of trading decisions and the API bill (or, if self-hosted, the electricity) adds up.

What I want to know is what comes after that:

What was the principal, and how much is +19.43% in actual money?
After subtracting the token fees, commissions, and slippage spent over the period, is it still positive?
Is there excess return versus the index (market average) over the same window?

Without those, you can’t say “it earned a return exceeding its cost.” A demo that only shows a big return percentage needs to be retold in take-home terms net of running cost — that’s my usual stance. With multi-agent setups, what bites is “how many tokens to complete the task,” not “the unit price”⁴, so this needs to be measured, not assumed.

Still, a Japanese company making LLM news is a good thing#

I’ve been critical, but I want to land on a positive note.

The benchmarks need careful reading, and the “beats Fable” headlines are inflated. Even so, the plain fact that a model from Tokyo-based Sakana AI is on the table of the global AI-architecture debate, foreign media included⁶, is genuinely good. Names being compared on the same stage as Fable and Mythos would have been hard to imagine not long ago.

The AI-sovereignty framing — “use learned orchestration to avoid vendor lock-in and geopolitical export-control risk”¹ — is also a coherent direction for a Japanese company to stake out. Putting out a pragmatic “bundle and substitute” answer now that Fable 5 is unavailable⁴ is grounded, too.

Discount the numbers calmly. But cheer the attempt. You’re allowed to do both.

Closing#

In one sentence:

Fugu’s “beats Fable” mostly comes down to LiveCodeBench; on the practical SWE-Bench Pro it doesn’t reach, and the figures are all vendor-published numbers — so the “it won” framing is inflated. That said, I’m positive on the orchestration idea and on a Japanese company stepping onto that stage.

Benchmarks all look like the same “score” if you only read the names and numbers, but SWE-Bench Pro and LiveCodeBench measure completely different things. Before saying who won, check which ruler did the measuring. Beyond Fugu, that’s the one thing I want to keep in mind when reading news about the models still to come.

References#

Sakana Fugu: One Model to Command Them All (official release) https://sakana.ai/fugu-release/
Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs (MarkTechPost) https://www.marktechpost.com/2026/06/22/sakana-ai-launches-sakana-fugu-an-orchestration-model-that-routes-tasks-across-a-swappable-pool-of-frontier-llms/
Trying Sakana Fugu (GA) on a subscription plan (Classmethod) https://dev.classmethod.jp/en/articles/sakana-fugu-ga-first-touch/
What is Sakana Fugu? Fugu Ultra’s performance, pricing, and position as a Fable 5 alternative (AI Souken) https://www.ai-souken.com/article/what-is-sakana-fugu
Sakana AI announces ‘Sakana Fugu,’ a multi-agent system that boasts of surpassing Claude Fable (GIGAZINE) https://gigazine.net/gsc_news/en/20260622-sakana-fugu-multi-agent-system-ai/
Japanese AI Startup’s Fugu Matches Anthropic’s Fable & Mythos But Sparks Debate On AI Architecture (ETV Bharat) https://www.etvbharat.com/en/technology/japanese-ai-startup-sakana-fugu-matches-anthropic-fable-and-mythos-but-sparks-debate-on-ai-architecture-enn26062301825