The Slowest Thing on Your Phone
Everything on your phone is instant - Instagram, YouTube, Google Search. Then you ask an AI a question, and you wait, watching words appear one at a time like a telegraph.
There's a reason. Every time you use an AI model, a chip somewhere has to generate your answer token by token - a process called inference. Training is how models learn; inference is the unglamorous work of actually running them, billions of times a day, and it is fast becoming the dominant computing workload on earth.
This week, the company that bet its existence on inference goes public. Cerebras Systems - maker of a single chip the size of a dinner plate - is expected to price its Nasdaq IPO Wednesday and begin trading Thursday under the ticker CBRS. The order book, per Reuters reporting, is more than twenty times oversubscribed. OpenAI has signed a contract worth more than $20 billion. Amazon reportedly bought $270 million of stock. And NVIDIA - the company whose market Cerebras is explicitly attacking - just spent $20 billion of its own money arming itself against exactly this thesis.
We used Barebone AI to rebuild the Cerebras story from the amended prospectus, a decade of funding rounds, and eighteen months of benchmark wars. Four numbers frame everything: $510 million of 2025 revenue, up 76% in a year. About 86% of it from two Abu Dhabi-linked entities. A GAAP "profit" that disappears the moment you read the footnote. And an ask of up to $48.8 billion - roughly 90 to 96 times last year's sales.
The most interesting IPO of the year is not automatically the best-priced one. Let's take it apart.
One Chip the Size of a Dinner Plate
The idea Cerebras is going public on is older than most of its engineers - and it has a body count.
In 1980, Gene Amdahl, the legendary IBM mainframe architect, founded Trilogy Systems to build computers on one whole silicon wafer instead of cutting the wafer into hundreds of small chips. Trilogy raised roughly $230 million - the most lavishly funded startup of its era. At a test in December 1983, two microscopic wires crossed and the giant circuit glowed red before dying. By mid-1984 wafer-scale was abandoned; Trilogy became one of Silicon Valley's biggest financial failures of the pre-dotcom age.
The physics problem is brutal. Silicon wafers always carry defects, so for forty years the industry's answer was to dice each wafer into hundreds of small chips and throw away the broken ones. Make the whole wafer one chip, and a single flaw can kill the entire thing.
Cerebras - founded in 2016 by five alumni of SeaMicro, the server startup Andrew Feldman sold to AMD for $334 million - designed around the defects instead: tile the wafer with hundreds of thousands of tiny identical cores, expect some to arrive dead, and route around them. Not thousands of chips talking to each other. One chip. Three generations of it later, this is the WSE-3 next to NVIDIA's workhorse GPU:
| Spec | Cerebras WSE-3 | NVIDIA H100 |
|---|---|---|
| Silicon area | 46,225 mm² | 826 mm² |
| Transistors | 4 trillion | 80 billion |
| Compute cores | 900,000 | 16,896 |
| On-chip memory | 44 GB SRAM | 50 MB |
| Memory bandwidth | 21 PB/s | 3.35 TB/s |
| Peak FP16 compute | 125 petaflops | ~2 petaflops |
Divide the last row and you get the marketing claim: one wafer is theoretically equivalent to about 62 H100s. The operative word is theoretically - that is peak arithmetic against NVIDIA's previous-generation part, not delivered performance.
The row that actually matters is memory bandwidth. Generating text is a memory-bound problem: for every single token, the chip has to touch the model's weights. A GPU keeps those weights in memory stacked next to the processor and pays a toll on every trip - and a cluster of GPUs pays a second toll every time data hops between chips. Cerebras keeps the weights in 44 GB of memory on the silicon itself, with more than 6,000 times the bandwidth of an H100's memory system. That is the entire trick. One chip, no hops.
Speed Became the Product
For years that trick looked like a science project, because the money was in training - a batch job where throughput beats latency. Two things changed.
First, the economics. Training is a capital expense for a handful of frontier labs. Inference is an operating expense for everyone, forever. The buyer pool goes from five labs to every enterprise on the planet.
Second, agents. An AI agent - software that books the flight, writes the code, runs the workflow - chains together dozens of model calls per task. A chatbot streaming at 60 tokens per second feels fine because you read along. An agent waiting on itself forty times in a row lives or dies by whether each step takes seconds or milliseconds. Latency compounds.
That's the demand curve Cerebras spent 2024 and 2025 benchmarking itself onto, with independent verification from Artificial Analysis:
| Benchmark | Cerebras speed | Context |
|---|---|---|
| Llama 3.1 405B (Nov 2024) | 969 tokens/sec | A frontier-scale model at interactive speed |
| Llama 4 Scout (Apr 2025) | 2,600+ tokens/sec | 19x the fastest GPU provider at the time |
| Llama 4 Maverick (May 2025) | 2,522 tokens/sec | vs 1,038 for NVIDIA's tuned Blackwell submission |
Benchmarks are marketing until somebody pays for them. Then, inside about six months, the three biggest names in AI infrastructure all paid.
Three Deals in Six Months
| When | Deal | Terms |
|---|---|---|
| Late 2025 | NVIDIA + Groq | $20B licensing-and-acquihire: Groq's inference IP plus its engineering team |
| Early 2026 | OpenAI + Cerebras | $20B+ multi-year purchase of 750 MW of compute; warrants for a minority stake; ~$1B from OpenAI toward data centers |
| Mar 2026 | AWS + Cerebras | CS-3 systems deployed inside AWS data centers, sold through Bedrock; Amazon reportedly bought ~$270M of stock |
Start with NVIDIA, because it's the tell. NVIDIA paid $20 billion for Groq - the other speed-first inference challenger - in a deal structured as a license-plus-acquihire of its technology and people rather than a purchase of the company. Jensen Huang explained the logic in writing:
"We plan to integrate Groq's low-latency processors into the NVIDIA AI factory architecture."
Incumbents do not spend $20 billion on categories they consider irrelevant. In one stroke, the deal validated the inference-speed thesis Cerebras is selling - and armed the only competitor that matters with dedicated low-latency silicon of its own. By GTC in March, Groq-derived accelerators were already on NVIDIA's roadmap slides.
OpenAI went the other way. Its agreement, reported by The Information in early 2026, commits more than $20 billion over multiple years for 750 megawatts of Cerebras capacity, hands OpenAI warrants for a minority stake, and includes about $1 billion from OpenAI toward data centers. One detail the louder headlines garbled: Sam Altman and Greg Brockman were early personal investors in Cerebras. A fair governance question for OpenAI - but disclosed in the IPO filings since 2024, and never a secret.
Amazon's deal may be the most telling. AWS will put Cerebras systems in its own data centers and resell them through Bedrock, with Amazon's Trainium 3 chips handling prefill - reading your prompt, a compute-bound job - and the Cerebras wafer handling decode, the memory-bound work of generating the answer. AWS says the pairing speeds up output by roughly 5x. Note what Amazon kept for itself: the half of inference its own silicon is good at.
None of this is revenue yet. Which brings us to the filing.