Frontier Intelligence at 10,000 Tokens a Second? The 2026 Case
By Richard Valente
Most AI arguments are about which model is smartest. I want to make one about speed, because the line between “fast” and “frontier” looks like it is closing, and that changes what you can build.
The bet: a frontier-capable model running at 10,000 tokens per second is plausible, not fantasy. On the published numbers, it could come from three trends that are each real, each measured, and each pointed the same way. Intelligence is getting denser. Software is getting faster on the same hardware. And wafer-scale silicon sidesteps the memory wall that constrains conventional GPUs. Here is the math, and here is where it breaks.
To feel the number first: 10,000 tokens a second is about 7,500 words a second.1 Set that against one person typing at 50 words a minute, eight hours a day.2
One person typing 50 WPM, eight hours a day; a 40-year career at that pace is about 250M words, one machine-day about 650M. Typing speed overstates finished thinking, but it makes the raw scale legible.
Disclosure: the author holds a financial position (shares and call options) in Cerebras. This is a technical analysis, not investment advice.
Today’s frontier, for scale
Start with the smartest model you can call right now. As of June 2026 that is Claude Opus 4.8, the current #1 on the Artificial Analysis Intelligence Index. Its measured output runs at about 62 tokens per second, and Anthropic’s new Fast mode lifts that to roughly 150.3 The bet in this piece is that same tier of intelligence at 10,000 tokens per second. Call it 65 to 160 times faster than the best model you can use today.
That gap is not abstract. Hold quality roughly fixed and ask how long a job takes.
~10k tokens
~100k tokens
Output speed only, quality held roughly equal. Opus 4.8 ≈ 62 tok/s (Artificial Analysis, Anthropic API); Fast mode ≈ 150. Adaptive reasoning adds thinking time on top.
The same answer, delivered while you are still reading the question. That is the line between a tool you wait on and a tool you think with.
The wall everyone hits
Today’s best models are autoregressive. They write one token, look at what they wrote, write the next one. One at a time, in a line. Every token forces the chip to read the entire model out of memory. The chip is rarely waiting on math. It is waiting on memory bandwidth. That is the wall.
SAME MODEL, SAME WEIGHTS, DIFFERENT SILICON
Tokens per second on gpt-oss-120B, plus the projected diffusion + wafer-scale stack.
*Theoretical. The striped bar is a projection (Cerebras peak 3,000 × diffusion best-case 4), not a measured result. Nobody has shipped this pairing. The four solid bars are real benchmarks; bars scaled to 12,000 tok/s. GPU figures: Baseten / Cerebras. Cerebras: launch + Artificial Analysis.
Where the 3,000 came from, and what it cost
Start with the number that is already real. On launch day for OpenAI’s open gpt-oss-120B model, Cerebras reported 3,000 tokens per second.4 Independent benchmarking from Artificial Analysis shows a sustained median closer to 1,800.5 The 3,000 is a peak, launch-day figure; the 1,800 is closer to what holds under real load. Either way, compare that to GPUs running the identical model: 100 to 300 tokens per second on an H100, and about 650 on the best-optimized Blackwell setup anyone has shipped.6 Roughly 3x to 20x faster on the same model and weights, at comparable output quality.
How? A normal GPU keeps model weights in fast memory and streams them across a narrow pipe for every token. Cerebras builds a chip the size of a dinner plate and keeps the weights in on-chip SRAM, right next to the compute. The pipe stops being the problem.
Here is the catch nobody put on the headline. gpt-oss-120B is a strong open model, but it is not the smartest model in the room. It is a sparse Mixture-of-Experts: 120 billion parameters total, only about 5.1 billion active per token.7 That sparsity is exactly why it runs fast, and it is also why it is not frontier-tier on the hardest reasoning. The 3,000 came with a trade: blistering speed on a capable, efficient model, not on the smartest thing available. Speed or intelligence. For a while you had to pick one.
That trade is the whole story. And it is closing.
The trade is closing: intelligence is getting denser
Intelligence per parameter is rising fast enough that “the small efficient model you can run fast” and “the frontier model” are converging into the same thing.
Look at one family over two years. On our reading of the Artificial Analysis Intelligence Index, Qwen2.5 72B (September 2024) scores around 16, and Qwen3.5 4B (March 2026) scores about 27.8 A model eighteen times smaller, shipped seventeen months later, comes out ahead. Hold the size fixed and the trend is just as steep: the 4B slot moved from roughly 14 in April 2025 to about 27 by March 2026. Same footprint, intelligence nearly doubled in under a year.
A 4B MODEL NOW OUTSCORES A 72B FROM 17 MONTHS AGO
Artificial Analysis Intelligence Index, select Qwen models.
Same 4B footprint nearly doubled in under a year, and passed a model 18x its size.
The thing that makes a model fast on Cerebras, fewer active parameters and a smaller memory footprint, is the same thing getting smarter every quarter. The speed-optimized model and the smart model are becoming one model. That is what “faster frontier intelligence” means: not a frontier model dragged down to run fast, but an efficient model climbing to frontier quality while keeping the speed.
And software alone is already worth ~6x
You do not even need new silicon to move the speed number. Same model, same chip, better software.
I benchmarked Xiaomi’s MiMo v2.5 Pro against its “UltraSpeed” variant. Identical 1-trillion-parameter model. The only difference is the serving stack: FP4 weights plus DFlash speculative decoding, where a small draft model proposes a block of tokens and the big model verifies them in one pass. In my runs it accepted about 21 tokens per verification pass. Standard Pro ran around 64 tokens per second client-side; UltraSpeed ran around 389, with the server decoding near 457.9 Roughly 6x, same weights, no new hardware.
SAME 1T MODEL, SOFTWARE ONLY
standard Pro
UltraSpeed
FP4 + DFlash speculative decoding (~21 tokens accepted per pass). Valente Labs benchmark, June 2026.
Speculative decoding ships in a serving update, not a chip generation. So the speed line moves on two clocks at once: the slow hardware clock, new chips every few years, and the fast software clock, a new serving trick every few months.
The other lever: diffusion drops the one-at-a-time wall
Diffusion models, the idea behind AI image generation, do not write left to right. They start with a rough draft of the whole passage and sharpen all of it at once, over a handful of passes. Many tokens per step instead of one.
Google shipped this for text. DiffusionGemma, released June 2026, is a 26-billion-parameter MoE with about 3.8 billion active, and it denoises 256 tokens per forward pass.10 Google’s headline, up to 4x faster generation and 1,000-plus tokens per second on an H100, is a best case the company itself qualifies.10 Different wall, different sledgehammer.
The math, stacked
These levers attack different bottlenecks, so they multiply instead of fighting. Wafer-scale silicon removes the memory wall. Speculative decoding and diffusion remove the one-token-at-a-time wall. And rising intelligence density means the model you point all of this at is frontier-capable, not a toy.
Projected, not measured. Stack Cerebras’s peak 3,000 with diffusion’s best-case 4x parallel decode and you get roughly 12,000 tokens per second on paper:
wafer-scale (peak)
(best case)
on paper
Nobody has shipped this pairing. But on each vendor’s best-case published numbers, the projection clears 10,000. These are early figures, and the expectation is up, not a guarantee.
The hardware floor is rising too. Cerebras has shipped a new wafer generation roughly every two to three years, each on a smaller process node, and the next one is expected in the 2026 to 2027 window.11
CEREBRAS WAFER GENERATIONS
Every two to three years, on a smaller node. Directions analysts expect next, 3D-stacked SRAM and a finer process, would widen the on-chip memory this whole approach depends on. Not an official roadmap.
Where it breaks (the part a skeptic checks first)
I am not going to pretend the cable plugs in. Three honest problems:
Nobody has shipped this pairing. Cerebras’s software is tuned for autoregressive transformers. Running a diffusion decoder well on wafer-scale SRAM is real kernel engineering, not a config flag.
The 4x is a headline, not a law. Independent analysis of DiffusionGemma puts the matched-size, matched-quality speedup closer to 2x.12 A 2025 paper that standardized the test conditions found diffusion’s speed advantage “largely disappears” once you compare fairly against an optimized autoregressive baseline.13 Google itself says DiffusionGemma’s quality is below standard Gemma, and it pays roughly a 10x penalty on time-to-first-token.10
So the real band is wide. Stack the conservative numbers and you land near 3,600. Stack the headline numbers and you clear 12,000. 10,000 lives near the optimistic end of that band, not the middle.
THE HONEST RANGE
(1,800 × 2)
(3,000 × 4)
The honest version of the bet: all three trend lines are real, all three are moving, and they attack different bottlenecks, so the product is the thing to watch. Someone proves out the high end of that band within a year, or finds the wall that stops them. Either answer is worth knowing.
Why 10,000 would matter
The typing comparison up top makes the speed legible. Volume makes it absurd.
Put a whole career against it. A person typing 50 WPM, eight hours a day across a 40-year career, produces about 250 million words. One day of this machine is around 650 million.2 So in a single day it out-types an entire human working career, more than twice over. By the fairer measure, finished prose at maybe 1,000 words a day, the gap is wider still.
And if the intelligence-density trend holds, it could be a frontier-grade model doing the work, not a fast intern. An organization's worth of thinking-on-paper, at the speed of a search query.
Smartest model is a leaderboard fight. Fastest useful model is a different product. For a while you had to choose. The trade is closing, and the fast lane runs through denser models, smarter software, and wafer-scale silicon.
Each of those levers deserves its own teardown: the density curve, the software clock, and diffusion on wafer-scale silicon. Those are coming.
Footnotes
-
OpenAI Help Center, “What are tokens and how to count them” (~0.75 words per token; 10,000 tokens ≈ 7,500 words). https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them ↩
-
Average typing speed ~50-52 WPM (Cambridge/Aalto 2018, 168k participants): https://www.cam.ac.uk/research/news/what-makes-a-faster-typist. Finished composition is far slower than raw typing; ~500-1,000 words/day of polished prose is a commonly cited knowledge-work range. ↩ ↩2
-
Artificial Analysis, “Claude Opus 4.8” model page: ~61.9 output tokens per second on the first-party Anthropic API, ranked #1 on the Artificial Analysis Intelligence Index as of June 2026, time-to-first-token ~25s, Fast mode up to ~2.5x output speed. https://artificialanalysis.ai/models/claude-opus-4-8 ↩
-
Cerebras, “Cerebras Launches OpenAI’s gpt-oss-120B at a Blistering 3,000 tokens/sec,” Aug 5 2025. https://www.cerebras.ai/blog/cerebras-launches-openai-s-gpt-oss-120b-at-a-blistering-3-000-tokens-sec ↩
-
Artificial Analysis, gpt-oss-120B providers page (rolling 72-hour median, ~1,800 tok/s for Cerebras). https://artificialanalysis.ai/models/gpt-oss-120b/providers ↩
-
H100 ~100-300 tok/s: Cerebras, “Blackwell vs Cerebras,” Nov 2025 (https://www.cerebras.ai/blog/blackwell-vs-cerebras). Blackwell B200 ~650 tok/s: Baseten, “How we made the fastest gpt-oss on NVIDIA GPUs,” Oct 24 2025 (https://www.baseten.co/blog/how-we-made-the-fastest-gpt-oss-on-nvidia-gpus-60-percent-faster/). ↩
-
OpenAI gpt-oss model card, arXiv:2508.10925, Aug 8 2025 (120B total / 5.1B active, 128 experts, top-4 routing). https://arxiv.org/abs/2508.10925 ↩
-
Valente Labs analysis of the Qwen line, built on the Artificial Analysis Intelligence Index (https://artificialanalysis.ai), Apr 2024 to Apr 2026: Qwen2.5 72B ≈ 16 (Sep 2024), Qwen3 4B ≈ 14 (Apr 2025), Qwen3.5 4B ≈ 27 (Mar 2026). ↩
-
Valente Labs inference benchmark (model-exploration harness), June 2026, comparing Xiaomi MiMo v2.5 Pro vs the UltraSpeed variant (same ~1T model, FP4 + DFlash speculative decoding): ~64 tok/s client-side standard vs ~389 client / ~457 server decode on UltraSpeed. ↩
-
Google, “DiffusionGemma: the developer guide,” June 10 2026 (26B total / 3.8B active, 256 tokens per forward pass, up to 4x, 1,000+ tok/s H100, quality below standard Gemma, ~10x TTFT penalty). https://developers.googleblog.com/diffusiongemma-the-developer-guide/ ↩ ↩2 ↩3
-
Cerebras wafer-scale generations: WSE-1 (Aug 2019, TSMC 16nm), WSE-2 (Apr 2021, 7nm), WSE-3 (Mar 2024, 5nm), per Cerebras press releases and IEEE Spectrum (https://spectrum.ieee.org/cerebras-chip-cs3). WSE-4 timing and specs (a finer node, possible 3D-stacked SRAM) are analyst projection, e.g. The Next Platform (Oct 2025), not an official Cerebras roadmap. ↩
-
MarkTechPost, analysis of DiffusionGemma (matched-size throughput ~1.9x vs an autoregressive baseline), June 10 2026. https://www.marktechpost.com/2026/06/10/google-ai-releases-diffusiongemma-a-26b-moe-open-model-using-text-diffusion-for-up-to-4x-faster-generation/ (secondary source; treat the ~2x as directional.) ↩
-
“How Efficient Are Diffusion Language Models?” arXiv:2510.18480, Oct 2025 (speed advantage largely disappears under standardized conditions). https://arxiv.org/pdf/2510.18480 ↩