← Back to Research

Frontier Intelligence at 10,000 Tokens a Second? The 2026 Case

By Richard Valente

Most AI arguments are about which model is smartest. I want to make one about speed, because the line between “fast” and “frontier” looks like it is closing, and that changes what you can build.

The bet: a frontier-capable model running at 10,000 tokens per second is plausible, not fantasy. On the published numbers, it could come from three trends that are each real, each measured, and each pointed the same way. Intelligence is getting denser. Software is getting faster on the same hardware. And wafer-scale silicon sidesteps the memory wall that constrains conventional GPUs. Here is the math, and here is where it breaks.

To feel the number first: 10,000 tokens a second is about 7,500 words a second.1 Set that against one person typing at 50 words a minute, eight hours a day.2

10,000 tokens / second ≈ 7,500 words every second
1 second of the machine equals 2.5 hours of typing
1 minute of the machine equals ~19 work-days
1 hour of the machine equals ~4.3 work-years
1 day of the machine equals a 40-year career, 2.5x over

One person typing 50 WPM, eight hours a day; a 40-year career at that pace is about 250M words, one machine-day about 650M. Typing speed overstates finished thinking, but it makes the raw scale legible.

Disclosure: the author holds a financial position (shares and call options) in Cerebras. This is a technical analysis, not investment advice.

Today’s frontier, for scale

Start with the smartest model you can call right now. As of June 2026 that is Claude Opus 4.8, the current #1 on the Artificial Analysis Intelligence Index. Its measured output runs at about 62 tokens per second, and Anthropic’s new Fast mode lifts that to roughly 150.3 The bet in this piece is that same tier of intelligence at 10,000 tokens per second. Call it 65 to 160 times faster than the best model you can use today.

That gap is not abstract. Hold quality roughly fixed and ask how long a job takes.

~160x faster than today's #1 model, Claude Opus 4.8 (~62 tok/s)
target: 10,000 tok/s
The job
Opus 4.8 today
At 10,000 tok/s
A long memo
~10k tokens
~2.7 min
~1 sec
A full report or big refactor
~100k tokens
~27 min
~10 sec
An hour of Opus output
60 min
~22 sec

Output speed only, quality held roughly equal. Opus 4.8 ≈ 62 tok/s (Artificial Analysis, Anthropic API); Fast mode ≈ 150. Adaptive reasoning adds thinking time on top.

The same answer, delivered while you are still reading the question. That is the line between a tool you wait on and a tool you think with.

The wall everyone hits

Today’s best models are autoregressive. They write one token, look at what they wrote, write the next one. One at a time, in a line. Every token forces the chip to read the entire model out of memory. The chip is rarely waiting on math. It is waiting on memory bandwidth. That is the wall.

SAME MODEL, SAME WEIGHTS, DIFFERENT SILICON

Tokens per second on gpt-oss-120B, plus the projected diffusion + wafer-scale stack.

H100 (GPU)
~300
Blackwell B200 (GPU)
~650
Cerebras (sustained)
~1,800
Cerebras (launch peak)
3,000
Diffusion × wafer-scale THEORETICAL
~12,000*

*Theoretical. The striped bar is a projection (Cerebras peak 3,000 × diffusion best-case 4), not a measured result. Nobody has shipped this pairing. The four solid bars are real benchmarks; bars scaled to 12,000 tok/s. GPU figures: Baseten / Cerebras. Cerebras: launch + Artificial Analysis.

Where the 3,000 came from, and what it cost

Start with the number that is already real. On launch day for OpenAI’s open gpt-oss-120B model, Cerebras reported 3,000 tokens per second.4 Independent benchmarking from Artificial Analysis shows a sustained median closer to 1,800.5 The 3,000 is a peak, launch-day figure; the 1,800 is closer to what holds under real load. Either way, compare that to GPUs running the identical model: 100 to 300 tokens per second on an H100, and about 650 on the best-optimized Blackwell setup anyone has shipped.6 Roughly 3x to 20x faster on the same model and weights, at comparable output quality.

How? A normal GPU keeps model weights in fast memory and streams them across a narrow pipe for every token. Cerebras builds a chip the size of a dinner plate and keeps the weights in on-chip SRAM, right next to the compute. The pipe stops being the problem.

Here is the catch nobody put on the headline. gpt-oss-120B is a strong open model, but it is not the smartest model in the room. It is a sparse Mixture-of-Experts: 120 billion parameters total, only about 5.1 billion active per token.7 That sparsity is exactly why it runs fast, and it is also why it is not frontier-tier on the hardest reasoning. The 3,000 came with a trade: blistering speed on a capable, efficient model, not on the smartest thing available. Speed or intelligence. For a while you had to pick one.

That trade is the whole story. And it is closing.

The trade is closing: intelligence is getting denser

Intelligence per parameter is rising fast enough that “the small efficient model you can run fast” and “the frontier model” are converging into the same thing.

Look at one family over two years. On our reading of the Artificial Analysis Intelligence Index, Qwen2.5 72B (September 2024) scores around 16, and Qwen3.5 4B (March 2026) scores about 27.8 A model eighteen times smaller, shipped seventeen months later, comes out ahead. Hold the size fixed and the trend is just as steep: the 4B slot moved from roughly 14 in April 2025 to about 27 by March 2026. Same footprint, intelligence nearly doubled in under a year.

A 4B MODEL NOW OUTSCORES A 72B FROM 17 MONTHS AGO

Artificial Analysis Intelligence Index, select Qwen models.

Qwen2.5 72B · Sep 2024
16
Qwen3 4B · Apr 2025
14
Qwen3.5 4B · Mar 2026
27

Same 4B footprint nearly doubled in under a year, and passed a model 18x its size.

The thing that makes a model fast on Cerebras, fewer active parameters and a smaller memory footprint, is the same thing getting smarter every quarter. The speed-optimized model and the smart model are becoming one model. That is what “faster frontier intelligence” means: not a frontier model dragged down to run fast, but an efficient model climbing to frontier quality while keeping the speed.

And software alone is already worth ~6x

You do not even need new silicon to move the speed number. Same model, same chip, better software.

I benchmarked Xiaomi’s MiMo v2.5 Pro against its “UltraSpeed” variant. Identical 1-trillion-parameter model. The only difference is the serving stack: FP4 weights plus DFlash speculative decoding, where a small draft model proposes a block of tokens and the big model verifies them in one pass. In my runs it accepted about 21 tokens per verification pass. Standard Pro ran around 64 tokens per second client-side; UltraSpeed ran around 389, with the server decoding near 457.9 Roughly 6x, same weights, no new hardware.

SAME 1T MODEL, SOFTWARE ONLY

~64
tok/s
standard Pro
~389
tok/s
UltraSpeed
~6x
no new hardware

FP4 + DFlash speculative decoding (~21 tokens accepted per pass). Valente Labs benchmark, June 2026.

Speculative decoding ships in a serving update, not a chip generation. So the speed line moves on two clocks at once: the slow hardware clock, new chips every few years, and the fast software clock, a new serving trick every few months.

The other lever: diffusion drops the one-at-a-time wall

Diffusion models, the idea behind AI image generation, do not write left to right. They start with a rough draft of the whole passage and sharpen all of it at once, over a handful of passes. Many tokens per step instead of one.

Google shipped this for text. DiffusionGemma, released June 2026, is a 26-billion-parameter MoE with about 3.8 billion active, and it denoises 256 tokens per forward pass.10 Google’s headline, up to 4x faster generation and 1,000-plus tokens per second on an H100, is a best case the company itself qualifies.10 Different wall, different sledgehammer.

The math, stacked

These levers attack different bottlenecks, so they multiply instead of fighting. Wafer-scale silicon removes the memory wall. Speculative decoding and diffusion remove the one-token-at-a-time wall. And rising intelligence density means the model you point all of this at is frontier-capable, not a toy.

Projected, not measured. Stack Cerebras’s peak 3,000 with diffusion’s best-case 4x parallel decode and you get roughly 12,000 tokens per second on paper:

3,000
tok/s, Cerebras
wafer-scale (peak)
×
4
diffusion
(best case)
12,000
tok/s
on paper

Nobody has shipped this pairing. But on each vendor’s best-case published numbers, the projection clears 10,000. These are early figures, and the expectation is up, not a guarantee.

The hardware floor is rising too. Cerebras has shipped a new wafer generation roughly every two to three years, each on a smaller process node, and the next one is expected in the 2026 to 2027 window.11

CEREBRAS WAFER GENERATIONS

WSE-1
2019 · 16nm
WSE-2
2021 · 7nm
WSE-3
2024 · 5nm (now)
WSE-4
~2026-27 · expected

Every two to three years, on a smaller node. Directions analysts expect next, 3D-stacked SRAM and a finer process, would widen the on-chip memory this whole approach depends on. Not an official roadmap.

Where it breaks (the part a skeptic checks first)

I am not going to pretend the cable plugs in. Three honest problems:

Nobody has shipped this pairing. Cerebras’s software is tuned for autoregressive transformers. Running a diffusion decoder well on wafer-scale SRAM is real kernel engineering, not a config flag.

The 4x is a headline, not a law. Independent analysis of DiffusionGemma puts the matched-size, matched-quality speedup closer to 2x.12 A 2025 paper that standardized the test conditions found diffusion’s speed advantage “largely disappears” once you compare fairly against an optimized autoregressive baseline.13 Google itself says DiffusionGemma’s quality is below standard Gemma, and it pays roughly a 10x penalty on time-to-first-token.10

So the real band is wide. Stack the conservative numbers and you land near 3,600. Stack the headline numbers and you clear 12,000. 10,000 lives near the optimistic end of that band, not the middle.

THE HONEST RANGE

~3,600
Conservative
(1,800 × 2)
10,000
The target
~12,000
Headline
(3,000 × 4)

The honest version of the bet: all three trend lines are real, all three are moving, and they attack different bottlenecks, so the product is the thing to watch. Someone proves out the high end of that band within a year, or finds the wall that stops them. Either answer is worth knowing.

Why 10,000 would matter

The typing comparison up top makes the speed legible. Volume makes it absurd.

~2 min
The complete works of Shakespeare (~884k words)
~300 / hour
Full-length novels (~90k words each)
~1 week
All of English Wikipedia (~4.7B words)

Put a whole career against it. A person typing 50 WPM, eight hours a day across a 40-year career, produces about 250 million words. One day of this machine is around 650 million.2 So in a single day it out-types an entire human working career, more than twice over. By the fairer measure, finished prose at maybe 1,000 words a day, the gap is wider still.

~650 million words a day

And if the intelligence-density trend holds, it could be a frontier-grade model doing the work, not a fast intern. An organization's worth of thinking-on-paper, at the speed of a search query.

Smartest model is a leaderboard fight. Fastest useful model is a different product. For a while you had to choose. The trade is closing, and the fast lane runs through denser models, smarter software, and wafer-scale silicon.

Each of those levers deserves its own teardown: the density curve, the software clock, and diffusion on wafer-scale silicon. Those are coming.

Footnotes

  1. OpenAI Help Center, “What are tokens and how to count them” (~0.75 words per token; 10,000 tokens ≈ 7,500 words). https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

  2. Average typing speed ~50-52 WPM (Cambridge/Aalto 2018, 168k participants): https://www.cam.ac.uk/research/news/what-makes-a-faster-typist. Finished composition is far slower than raw typing; ~500-1,000 words/day of polished prose is a commonly cited knowledge-work range. 2

  3. Artificial Analysis, “Claude Opus 4.8” model page: ~61.9 output tokens per second on the first-party Anthropic API, ranked #1 on the Artificial Analysis Intelligence Index as of June 2026, time-to-first-token ~25s, Fast mode up to ~2.5x output speed. https://artificialanalysis.ai/models/claude-opus-4-8

  4. Cerebras, “Cerebras Launches OpenAI’s gpt-oss-120B at a Blistering 3,000 tokens/sec,” Aug 5 2025. https://www.cerebras.ai/blog/cerebras-launches-openai-s-gpt-oss-120b-at-a-blistering-3-000-tokens-sec

  5. Artificial Analysis, gpt-oss-120B providers page (rolling 72-hour median, ~1,800 tok/s for Cerebras). https://artificialanalysis.ai/models/gpt-oss-120b/providers

  6. H100 ~100-300 tok/s: Cerebras, “Blackwell vs Cerebras,” Nov 2025 (https://www.cerebras.ai/blog/blackwell-vs-cerebras). Blackwell B200 ~650 tok/s: Baseten, “How we made the fastest gpt-oss on NVIDIA GPUs,” Oct 24 2025 (https://www.baseten.co/blog/how-we-made-the-fastest-gpt-oss-on-nvidia-gpus-60-percent-faster/).

  7. OpenAI gpt-oss model card, arXiv:2508.10925, Aug 8 2025 (120B total / 5.1B active, 128 experts, top-4 routing). https://arxiv.org/abs/2508.10925

  8. Valente Labs analysis of the Qwen line, built on the Artificial Analysis Intelligence Index (https://artificialanalysis.ai), Apr 2024 to Apr 2026: Qwen2.5 72B ≈ 16 (Sep 2024), Qwen3 4B ≈ 14 (Apr 2025), Qwen3.5 4B ≈ 27 (Mar 2026).

  9. Valente Labs inference benchmark (model-exploration harness), June 2026, comparing Xiaomi MiMo v2.5 Pro vs the UltraSpeed variant (same ~1T model, FP4 + DFlash speculative decoding): ~64 tok/s client-side standard vs ~389 client / ~457 server decode on UltraSpeed.

  10. Google, “DiffusionGemma: the developer guide,” June 10 2026 (26B total / 3.8B active, 256 tokens per forward pass, up to 4x, 1,000+ tok/s H100, quality below standard Gemma, ~10x TTFT penalty). https://developers.googleblog.com/diffusiongemma-the-developer-guide/ 2 3

  11. Cerebras wafer-scale generations: WSE-1 (Aug 2019, TSMC 16nm), WSE-2 (Apr 2021, 7nm), WSE-3 (Mar 2024, 5nm), per Cerebras press releases and IEEE Spectrum (https://spectrum.ieee.org/cerebras-chip-cs3). WSE-4 timing and specs (a finer node, possible 3D-stacked SRAM) are analyst projection, e.g. The Next Platform (Oct 2025), not an official Cerebras roadmap.

  12. MarkTechPost, analysis of DiffusionGemma (matched-size throughput ~1.9x vs an autoregressive baseline), June 10 2026. https://www.marktechpost.com/2026/06/10/google-ai-releases-diffusiongemma-a-26b-moe-open-model-using-text-diffusion-for-up-to-4x-faster-generation/ (secondary source; treat the ~2x as directional.)

  13. “How Efficient Are Diffusion Language Models?” arXiv:2510.18480, Oct 2025 (speed advantage largely disappears under standardized conditions). https://arxiv.org/pdf/2510.18480