✏️ The Math Behind LLM Pricing 01: How Inference Actually Works — Starting with “Moving Stuff vs. Computing Stuff”

📖

This is the opening post in a series on LLM inference. Across the series we’ll use a bit of math to dig into how LLM inference services get priced — and the structural quirks that come with it: AI’s strong scale economies, the small-player trap, and why some pricing tricks just don’t work. This first post stays away from formulas and just builds intuition — let’s get a feel for things first.

I. First, Meet the Three Players

Have you ever wondered what’s actually happening inside the machine when you press Enter in ChatGPT?

To answer that, we need to put three players on the table first: model parameters, VRAM, and compute units.

1. Model Parameters: A Model Is Really Just a Pile of Numbers

You’ve probably heard things like “GPT-5 has hundreds of billions of parameters” or “DeepSeek is a 700B model.” Parameters sound mysterious, but they’re really just a pile of floating-point numbers, arranged into a bunch of matrices.

A 700B (700-billion-parameter) model, at FP8 precision (1 byte per parameter), comes out to roughly 700 GB.

You can picture it as a giant cookbook with 700 billion recipes in it. But this isn’t a cookbook you can flip through whenever you like — the whole thing has to sit open on the chef’s workbench, ready to consult at any moment, before any cooking can start.

2. VRAM (HBM): The “Workbench” Where the Chef Can Flip Through Fast

That “place where you can flip through fast” is the GPU’s VRAM (HBM).

🔑 Key fact 1: At runtime, all parameters must sit in VRAM.

A single H100 GPU only has 80 GB of VRAM, nowhere near enough for a 700 GB model. So in reality you need many GPUs stitched together — that’s where terms like “rack” and “cluster” come from.

3. Compute Units: The Workers Who Actually Do the Work

Inside the GPU there’s also a swarm of compute units that do multiplications and additions. You can picture the whole GPU as a factory:

Compute units = workers: blazingly fast
VRAM = warehouse: stores 700 GB of parameters
Conveyor belt (memory bandwidth): hauls parameters from the warehouse to the workers

The awkward truth about modern GPUs is right here: the workers are absurdly fast, but the conveyor belt from the warehouse to the workers isn’t fast enough.

II. To Generate One Character, What Actually Happens?

Suppose you ask the model “How’s the weather today?” and the model answers “Sunny today.”

The model doesn’t think up all those words at once — it spits them out one at a time:


Step 1: Sees "How's the weather today?"           → outputs "Sun"
Step 2: Sees "How's the weather today? Sun"       → outputs "ny"
Step 3: Sees "How's the weather today? Sunny"     → outputs " to"
...

This “lean on the previous token to generate the next” trick has a formal name: autoregressive generation.

And every time it generates one token, the model’s entire parameter set has to be moved in full.

🔑 Key fact 2: Every output token requires “moving” all parameters once.

Output 100 tokens and you’ve moved 700 GB of parameters 100 times. The cost of doing that is at the root of nearly every issue in LLM inference.

III. So Why Is Inference Slow? — A Counterintuitive Fact

You might assume the model is slow because computing is slow.

It isn’t. What’s actually slow is moving the parameters from VRAM to the compute units.

Just run the numbers: an H100’s memory bandwidth is about 3 TB/s, which sounds like a lot — but moving 700 GB of parameters once still takes at minimum:

\frac{700 \text{ GB}}{3 \text{ TB/s}} \approx 0.23 \text{ s}

That’s 230 milliseconds — the physical floor for “running a 700B model on a single H100” per generated token. Roughly 4 tokens/second.

🔑 Key fact 3: LLM inference is usually bottlenecked not by compute, but by memory bandwidth.

This state has a specific name: memory-bound (limited by memory bandwidth). Its counterpart is compute-bound (limited by computation). All of inference optimization, in the end, is just tugging back and forth between these two.

If every output token requires moving the parameters once, what’s the most wasteful thing?

Alice and Bob are both using ChatGPT. If the machine first computes a token for Alice (move parameters once), then computes a token for Bob (move parameters again), then 700 GB of parameters got moved twice.

But if Alice and Bob share one move — load the parameters once, then compute the next token for both at the same time — you only move once.

This is what batching is fundamentally trying to do:

🔑 The essence of batching: let one “parameter move” be shared by as many users as possible.

This also incidentally explains something you may have wondered about: why can ChatGPT serve so many people at once, with everyone feeling roughly the same speed? Because everyone is sharing the same parameter move.

⚠️

Note: batch only amortizes across “the same generation step from different users” — it cannot amortize across different tokens of the same user. Because tokens for the same user must be serial: token N depends on token N-1.

The Ceiling of Batching

Batch size can’t be infinitely large. Back to the factory metaphor:

Small batch: workers finish quickly and wait for the next batch of materials → bottleneck is the conveyor belt (memory-bound)
Large batch: materials pile up and workers can’t keep up → bottleneck is the workers (compute-bound)

There’s a sweet spot in the middle: the workers finish just before the next batch of materials arrives. Around this point, system efficiency peaks. For modern GPUs, this value sits somewhere around 300 × inverse sparsity (we’ll derive it next post).

🔑 The core tension of inference optimization: keep the conveyor belt and workers as balanced as possible — neither side waiting on the other.

V. KV Cache: Each User’s “Personal Notebook”

There’s a hidden problem we haven’t solved yet.

When the model computes Alice’s 51st token, it needs to “see” the first 50 tokens. Bob just started, so he sees nothing. How does the model “see” Alice’s previous 50 tokens?

The dumb way: feed those 50 tokens through again. But if Alice has already chatted for 10,000 tokens, re-running the previous 10,000 every time you generate a new one adds up to quadratic waste (generating N tokens this way piles up to roughly N²/2 redundant work).

The engineering answer is KV cache:

🔑 The KV cache is the model’s internal notebook of “tokens already read.”

A metaphor: you’re reading a very long novel and taking notes as you go. By page 1000 —

No cache: every page flip means re-reading all previous 999 pages → insane
With cache: you’ve already condensed pages 1–999 into notes, just glance at the notes → normal

To generate each new token, the model only needs to: read its own historical notes → compute the new token → append the new token to the notes.

The Critical Property of KV Cache: Private, Non-Sharable

This is the most important point of all:

🔑 Each user’s KV cache is private and can’t be shared across the batch.

Alice’s notes only contain Alice’s conversation; they have nothing to do with Bob. So “reading Alice’s notes” only serves Alice alone — adding more users won’t amortize the cost.

This means:

Item	Relation to batch	Relation to context length	Can it be amortized by batch?
Move model parameters	Fixed	Unchanged	Yes ✅
Move KV cache	Linear growth	Linear growth	No ❌
Compute	Linear growth	Almost unchanged	(this is the work itself)

Understand this table and you’ve understood 80% of LLM inference.

VI. Three Real-World Phenomena, Explained

Phenomenon 1: Why Does Ultra-Long Context Suddenly Cost More?

Historically, Google Gemini 1.5 Pro applied tier pricing past long-context thresholds (typically doubling the unit price). Why?

Short context (a few thousand tokens): KV notes are tiny, the main cost is moving parameters, fully amortized by batch → cheap
Past a critical point: KV-move time starts to exceed parameter-move time → the system flips from weight-bound to KV-bound, and KV cost can’t be amortized → price has to rise

That critical point isn’t pulled out of thin air — it’s a physically computed threshold. Later models (such as Gemini 2.5) have dropped the tier, suggesting architectural improvements that compress KV further.

Phenomenon 2: Why Are Output Tokens Several Times More Expensive Than Input Tokens?

Processing input (prefill): the entire input can be computed in parallel in one shot, workers fully utilized, closer to compute-bound, cheap (assuming the prompt isn’t too short)
Generating output (decode): the model must spit out one token at a time, and every token requires moving the parameters + the user’s own KV, memory-bound, expensive

Phenomenon 3: Why Can No Amount of Money Buy “Instant Answers”?

Even if you reserve the entire GPU for yourself alone, the model parameters still have to be moved from VRAM to the compute units. The speed of that is locked by physics (HBM bandwidth), and money can’t buy you out of it.

So “100× more money for 100× more speed” simply isn’t possible. The “fast tiers” some providers offer typically work by giving paying users a bigger share of the batch slots in an existing deployment, or by spreading the model across more GPUs to raise effective bandwidth — single users do get faster. But the trade-offs add up: batch amortization gets worse (cost goes up), and inter-GPU communication isn’t free either. The physical wall — moving 700 GB of parameters out of VRAM — doesn’t go away.

VII. One-Picture Summary

🗺️

Inference = “moving stuff” + “computing stuff”

What’s being moved:

Model parameters (large, fixed, amortizable by batch)
KV cache (per-user private, scales with users and context length, non-amortizable)

Time per token ≈ max(move time, compute time)

The core of optimization: keep the conveyor belt and workers balanced — neither side waiting on the other.

VIII. A Few Questions to Chew On

If a company claims its “inference service is 10× faster than others,” what tactics might it be using? Which are real, and which just swap batching for dedicated capacity?
Why does an MoE (sparse-expert) model actually push the optimal batch larger at inference time? Hint: think about “inverse sparsity.”
If HBM bandwidth grows 10× in the future, what happens to LLM API prices? Which cost component will dominate the rewrite?