✏️ Relearning LLMs: How Inference Works, Inference Speed, and Inference Cost — Starting with “Moving Stuff vs. Computing Stuff”
A non-technical explainer. After reading, you’ll understand: why ChatGPT spits out characters one at a time, why it can serve millions of users simultaneously, why ultra-long context suddenly costs more, and why no amount of money can make it “answer instantly.”
I. First, Build a Mental Picture
When you press Enter in ChatGPT, what’s actually happening inside the machine?
To answer that, we first need to meet three characters: parameters, VRAM, and compute units.
1. Parameters: A Model Is Really Just a Pile of Numbers
You may have heard things like “GPT-5 has hundreds of billions of parameters” or “DeepSeek is a 700B model.” Parameters sound profound, but physically they’re quite mundane — just a pile of floating-point numbers arranged into many matrices.
A 700B (700-billion-parameter) model, at FP8 precision (1 byte per parameter), comes out to roughly 700 GB.
You can think of it as a giant cookbook with 700 billion rules. But you can’t flip through this cookbook casually — it has to be placed somewhere the chef can flip through fast before any cooking can happen.
2. VRAM (HBM): The “Workbench” Where the Chef Can Flip Through Fast
That “place where you can flip through fast” is the GPU’s VRAM (HBM).
🔑 Key fact 1: At runtime, all parameters must sit in VRAM.
A single H100 GPU only has 80 GB of VRAM, nowhere near enough for a 700 GB model. So in reality you need many GPUs stitched together — that’s where terms like “rack” and “cluster” come from.
3. Compute Units: The Workers Who Actually Do the Work
Inside the GPU there’s also a swarm of compute units that do multiplications and additions. You can picture the whole GPU as a factory:
- Compute units = workers: blazingly fast
- VRAM = warehouse: stores 700 GB of parameters
- Conveyor belt (memory bandwidth): hauls parameters from the warehouse to the workers
The awkward truth about modern GPUs is right here: the workers are absurdly fast, but the conveyor belt from the warehouse to the workers isn’t fast enough.
II. To Generate One Character, What Actually Happens?
Suppose you ask the model: “How’s the weather today?” And it answers, “Sunny today.”
The model doesn’t think up all those words at once; it spits them out one at a time:
Step 1: Sees "How's the weather today?" → outputs "Sun"
Step 2: Sees "How's the weather today? Sun" → outputs "ny"
Step 3: Sees "How's the weather today? Sunny" → outputs " to"
...This “depend on the previous token, then generate the next” approach is called autoregressive generation.
And to generate each token, the model’s entire parameter set has to be read in full.
🔑 Key fact 2: Every output token requires “moving” all parameters once.
Output 100 tokens, and you have to move 700 GB of parameters 100 times. The cost of this dominates everything.
III. So Why Is It Slow? — A Counterintuitive Fact
You might assume the model is slow because computing is slow.
It isn’t. What’s actually slow is moving parameters from VRAM to the compute units.
Some numbers for intuition: an H100’s memory bandwidth is around 3 TB/s, which sounds like a lot, but moving 700 GB of parameters once takes at minimum:
That’s 230 milliseconds — the physical floor for “running a 700B model on a single H100” per generated token. Roughly 4 tokens/second.
🔑 Key fact 3: LLM inference is usually bottlenecked not by compute, but by memory bandwidth.
This state has a specific name: memory-bound (limited by memory bandwidth). Its counterpart is compute-bound (limited by computation). The whole story of inference optimization is about tugging back and forth between these two.
IV. Batch: Letting Multiple Users Share One “Move”
If every output token requires moving the parameters once, what’s the most wasteful thing?
Alice and Bob are both using ChatGPT. If the machine first computes a token for Alice (move parameters once), then computes a token for Bob (move parameters again), then 700 GB of parameters got moved twice.
But if Alice and Bob share one move — load the parameters once, then compute the next token for both at the same time — you only move once.
This is the core motivation behind batching:
🔑 The essence of batch: let one “parameter move” be shared by as many users as possible.
This also directly explains a phenomenon you may have wondered about: why ChatGPT can serve so many people at once, with everyone feeling roughly the same speed? Because everyone is sharing the same parameter move.
Note: batch only amortizes across “the same generation step from different users” — it cannot amortize across different tokens of the same user. Because tokens for the same user must be serial: token N depends on token N-1.
The Ceiling of Batching
Batch size can’t be infinitely large. Back to the factory metaphor:
- Small batch: workers finish quickly and wait for the next batch of materials → bottleneck is the conveyor belt (memory-bound)
- Large batch: materials pile up and workers can’t keep up → bottleneck is the workers (compute-bound)
There’s a sweet spot in the middle: the workers finish just before the next batch of materials arrives. Around this point, system efficiency peaks. For modern GPUs, this value sits somewhere around 300 × sparsity.
🔑 The core tension of inference optimization: keep the conveyor belt and workers as balanced as possible — neither side waiting on the other.
V. KV Cache: Each User’s “Personal Notebook”
There’s a hidden problem we haven’t solved yet.
When the model computes Alice’s 51st token, it needs to “see” the first 50 tokens. Bob just started, so he sees nothing. How does the model “see” Alice’s previous 50 tokens?
The dumb way: feed those 50 tokens through again. But if Alice has already chatted for 10,000 tokens, re-running the previous 10,000 every time you generate a new one is exponential waste.
The engineering answer is KV cache:
🔑 The KV cache is the model’s internal notebook of “tokens already read.”
A metaphor: you’re reading a very long novel and taking notes as you go. By page 1000 —
- No cache: every page flip means re-reading all previous 999 pages → insane
- With cache: you’ve already condensed pages 1–999 into notes, just glance at the notes → normal
To generate each new token, the model only needs to: read its own historical notes → compute the new token → append the new token to the notes.
The Critical Property of KV Cache: Private, Non-Sharable
This is the most important point in the whole story:
🔑 Each user’s KV cache is private and can’t be shared across the batch.
Alice’s notes are about Alice’s conversation; they have nothing to do with Bob. So reading Alice’s notes only serves Alice alone — adding more users won’t amortize the cost.
This means:
| Item | Relation to batch | Relation to context length | Can it be amortized by batch? |
|---|---|---|---|
| Move model parameters | Fixed | Unchanged | Yes ✅ |
| Move KV cache | Linear growth | Linear growth | No ❌ |
| Compute | Linear growth | Almost unchanged | (this is the work itself) |
Understand this table and you’ve understood 80% of LLM inference.
VI. Using This Intuition to Explain Three Real Phenomena
Phenomenon 1: Why Does Ultra-Long Context Suddenly Cost More?
Google Gemini’s API price jumps 50% past 200K context. Why?
- Short context (a few thousand tokens): KV notes are tiny, the main cost is moving parameters, fully amortized by batch → cheap
- Past a critical point (≈ 200K): KV-move time starts to exceed parameter-move time → the system flips from weight-bound to KV-bound, and KV cost can’t be amortized → price must rise
200K isn’t a number pulled out of thin air — it’s a physically computed critical point.
Phenomenon 2: Why Are Output Tokens Several Times More Expensive Than Input Tokens?
- Processing input: the entire input can be computed in parallel in one shot, workers fully utilized, compute-bound, cheap
- Generating output: must spit out one token at a time, every token requires moving the parameters + own KV, memory-bound, expensive
Phenomenon 3: Why Can No Amount of Money Buy “Instant Answers”?
Even if you reserve the entire GPU for yourself alone, the model parameters still have to be moved from VRAM to the compute units. The speed of that is locked by physics (HBM bandwidth), and money can’t bypass it.
So “100× more money for 100× more speed” is impossible. Things like Claude’s FastMode work by spreading the same batch across more GPUs, giving an effectively higher bandwidth → faster for individual users; but past a certain point, the inter-GPU communication itself takes time, and the physical wall is still there.
VII. One-Picture Summary
Inference = “moving stuff” + “computing stuff”
What’s being moved:
- Model parameters (large, fixed, amortizable by batch)
- KV cache (per-user private, scales with users and context length, non-amortizable)
Time per token ≈ max(move time, compute time)
The core of optimization: keep the conveyor belt and workers balanced — neither side waiting on the other.
VIII. A Few Questions to Chew On
- If a company claims its “inference service is 10× faster than others,” what tactics might it be using? Which are real, and which just swap batching for dedicated capacity?
- Why does an MoE (sparse-expert) model actually push the optimal batch smaller at inference time? Hint: think about the word “sparsity.”
- If HBM bandwidth grows 10× in the future, what happens to LLM API prices? Which cost component will dominate the rewrite?