✏️ The Math Behind LLM Pricing 02: Writing Inference as Equations

📖

Last post we just got “a feel” for how LLM inference works at the hardware level. This time, let’s pick up the pen and turn those intuitions into equations — and you’ll see why every modern LLM inference service runs batch sizes squarely in the “thousands” range, and why long conversations push that optimal value even larger. These numbers aren’t engineers’ rules of thumb; they fall straight out of the hardware specs.

🧭

Recap: in the previous post, How Inference Actually Works — Starting with “Moving Stuff vs. Computing Stuff”, we built the core intuition: inference time = move time + compute time, whichever is slower wins; batching amortizes the cost of moving parameters, but it cannot amortize the KV cache.

I. Naming “Move” and “Compute”

Let’s give names to the two things we kept repeating in the last post:

Time spent moving stuff → call it T_memory
Time spent computing stuff → call it T_compute

So how long does it actually take to generate one token?

T \geq \max(T_{memory},\ T_{compute})

It’s max, not a sum. This detail matters a lot — Section VI is dedicated to it. For now, just remember: whichever is slower wins.

🔑 This is roofline analysis. The name sounds academic; in essence it’s just the line above. All of inference performance analysis starts from this one equation.

II. Writing Out T_compute

Recall: the work to compute, divided by worker speed, equals compute time.

Worker speed

How many multiplications per second can a GPU do? That number is called FLOPS (floating-point operations per second). H100 is around 2 × 10¹⁵, i.e. 2 quadrillion multiplications per second. It’s a hardware constant — fixed at purchase, software can’t change it.

Workload

How many multiplications need doing? Two things determine that:

batch size (how many users served at once): each extra user adds another batch of compute. Denote this as B.
multiplications per user per token: proportional to “how many parameters actually participate in the computation.”

The second variable needs unpacking, because it pulls out a key concept.

A new concept: active parameters vs. total parameters

Most modern LLMs use the MoE (Mixture of Experts) architecture: even though the total parameter count is 700B, only a small subset of parameters actually participates per token generated.

Analogy: you have a 700-page cookbook at home, but tonight you’re only making tomato-egg stir-fry, so you only flip 37 of those pages.

Total parameters N_total: total pages in the cookbook (700)
Active parameters N_active: pages actually flipped tonight (37)

For example DeepSeek V3: N_total = 671B, N_active = 37B.

📎

A quick technical aside: MoE’s sparse activation only happens in the FFN (feed-forward) layers — the attention layers’ Q/K/V/O projections still participate fully for every token. So N_active isn’t “the whole model scaled down by some ratio”; it’s “the full attention layers + only the experts the router selects in each FFN.” The 37B figure DeepSeek V3 reports already accounts for this.

🔑 Why split the two? Because they map to different costs:

Compute cost depends on N_active (only the pages you flipped open)

Move cost depends on N_total (the whole cookbook has to sit in VRAM, and gets read in full each time)

That’s the elegance of MoE — you trade sparse activation for cheaper compute, but the moving cost stays the same. This becomes crucial in Section V.

Writing T_compute

Each token does roughly 2 × N_active floating-point operations (each weight contributes one multiply + one add, which counts as 2 FLOPs), so when batch=B:

T_{compute} = \frac{B \times N_{active}}{C}

where C = FLOPS / 2, which you can read as “effective worker speed.”

🔑 Intuition: more users → more compute; more active params → more compute per user; faster workers → less time.

III. Writing Out T_memory

The previous post said: two things get moved — model parameters + KV cache.

Moving parameters

No matter how big the batch, the entire cookbook has to be read once (every token uses every layer).

How much to move: N_total
Conveyor belt speed: memory bandwidth, denoted as BW

H100’s BW ≈ 3 TB/s, also a hardware constant.

T_{\text{move-params}} = \frac{N_{total}}{BW}

💡

Note: there’s no B in this term. No matter how many users, the time to move parameters is the same — that’s the fundamental reason batching can amortize parameter movement.

Moving the KV cache

Each user has their own KV “personal notebook,” non-amortizable.

Bytes per token in the KV cache: denoted as b (lowercase, don’t confuse with the uppercase B for batch). In practice b = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element — every token stores one K and one V at every layer. For a typical large model (tens of layers, dozens of KV heads), b lands somewhere in the tens to hundreds of KB per token range. Lower numeric precision (FP16 → FP8) and aggressive KV-head sharing (GQA / MLA) both pull b down
Conversation length per user: denoted as L (context length)
Total KV size for B users: proportional to B × L × b

T_{\text{move-KV}} = \frac{B \times L \times b}{BW}

⚠️

This term has B — bigger batch means more KV to move, non-amortizable. This difference decides everything.

Combining the move parts

T_{memory} = \underbrace{\frac{N_{total}}{BW}}_{\text{move-params}} + \underbrace{\frac{B \times L \times b}{BW}}_{\text{move-KV}}

It’s a sum here, because both use the same conveyor belt — one belt can’t move two things at once. This will matter again in Section VI.

IV. One Picture for Everything

Now we have two functions of time vs. batch size. Horizontal axis is B, vertical axis is time.

T_compute: a line that starts at zero


T_compute
   │           ╱
   │         ╱
   │       ╱
   │     ╱
   │   ╱
   │ ╱
   └───────────► B
   0

T_memory: a line that starts at a positive height


T_memory
   │              ╱
   │            ╱
   │          ╱     ← slope = L × b / BW (KV part)
   │        ╱
   │      ╱
   │    ╱
   │  ╱
   │╱  ← y-intercept = N_total / BW (params part)
   │
   └───────────► B
   0

Actual time: take the upper one

Drawn together, T is their max — whichever line is on top wins:


time
   │                 ╱  ← T_compute (B × N_active/C)
   │                ╱     starts at zero
   │               ╱
   │              ╱
   │             ╱  ╱  ← T_memory (N_total/BW + B×L×b/BW)
   │            ╱ ╱      meets y-axis at N_total/BW
   │           ╱╱
   │          ╳   ← crossover = sweet spot
   │        ╱╱
   │     ╱  ╱
   │   ╱   ╱
   │ ╱    ╱
   │    ╱
   │   ╱
   │ ╱
   └──────────────────────► B
        crossover

Three regions, physical meaning

Region	Dominator	State	Meaning
Left (small B)	T_memory	memory-bound	belt waiting on workers → more batch is a free lunch, latency barely changes
Crossover	T_memory = T_compute	perfect balance	belt and worker speeds match, peak efficiency
Right (large B)	T_compute	compute-bound	workers can’t keep up → more batch doesn’t help, latency rises linearly

🔑 The core goal of inference optimization is to pin the system near the crossover. That’s why every inference service does batching, continuous batching, dynamic scheduling, etc. — all to keep B at the sweet spot.

V. Solving for the Sweet Spot: Where the Mysterious 300 Comes From

The sweet spot is where the two lines cross, i.e. T_memory = T_compute.

To keep the algebra clean, simplify one step first: assume the context isn’t too long and the KV term is negligible compared to the parameter term (we’ll come back to check this assumption shortly). That is:

T_{memory} \approx \frac{N_{total}}{BW}

Set the two sides equal:

\frac{N_{total}}{BW} = \frac{B \times N_{active}}{C}

Solve for B:

B = \underbrace{\frac{N_{total}}{N_{active}}}_{\text{model side}} \times \underbrace{\frac{C}{BW}}_{\text{hardware side}}

🔑 And here’s the elegant part: the formula falls cleanly into two halves — the left depends only on the model, the right only on the hardware. The two worlds don’t interfere.

The model term: inverse of sparsity

N_total / N_active is “total cookbook / pages flipped tonight” — i.e. the inverse of sparsity.

Dense model: every parameter used, N_total = N_active, ratio = 1
DeepSeek V3: 671B / 37B ≈ 18
Sparser MoEs: can reach 30+

The hardware term: the mysterious 300

C / BW: worker speed divided by belt speed. Plug in H100’s numbers (C ≈ 10¹⁵, BW ≈ 3 × 10¹²), after unit conversion:

\frac{C}{BW} \approx 300

That’s the mysterious 300 engineers talk about. It’s a physical constant of contemporary GPUs — A100, H100, B100 all have this ratio between 200 and 400.

🔑 Why is this ratio so large? Over the last decade, GPU compute has grown much faster than memory bandwidth. Compute went up tens of fold, bandwidth only a few — the ratio stretched out to ~300. This is the physical root cause of “modern GPUs being memory-bound.”

Final result

🎯

Putting it together, the sweet-spot formula (when KV is negligible):

B_{\text{optimal}} \approx 300 \times \frac{N_{total}}{N_{active}}

The left is a hardware constant (≈ 300), the right is the inverse of model sparsity.

Plug in DeepSeek V3:

B_{\text{optimal}} \approx 300 \times 18 \approx 5400

That’s why every major inference service runs batches in the “thousands” range — it’s not engineers’ rule of thumb; it’s a physical conclusion that falls out of the hardware specs.

VI. Two Counterintuitive Details

There are two places in this framework that are surprising at first but click once you see them.

Detail 1: Why T = max(move, compute), not T = move + compute?

The answer hides the core idea of GPU design —

🔑 Inside a GPU, the moving circuitry and the compute circuitry are two independent pieces of hardware that can work in parallel.

Analogy: in a factory, conveyor-belt workers (moving circuitry) and assembly workers (compute circuitry) are two different teams that can work simultaneously — while the belt is moving the next batch of materials, the assemblers are processing the previous one.


Case A: slow move, fast compute (memory-bound)
  move:    ████████████████████  (20ms)
  compute: ████  (4ms, then idle)
  total:   20ms ← move wins

Case B: fast move, slow compute (compute-bound)
  move:    ████  (4ms, then idle)
  compute: ████████████████████  (20ms)
  total:   20ms ← compute wins

The two workers overlap rather than relay, so it’s max not sum. All modern GPUs/TPUs/AI accelerators do this kind of overlap — that’s one of the core reasons they’re fast.

Detail 2: But why is T_memory itself a sum?

You might immediately spot a contradiction: above we said T = max, then why is T_memory = T_move-params + T_move-KV not also written as max?

The answer is simple:

🔑 Rule: same resource → sum; different resources → max.

Inside T_memory (move params + move KV): the same conveyor belt (HBM bandwidth), can’t run in parallel → sum
T_memory vs T_compute: two different pieces of hardware (belt vs workers), can run in parallel → max

Once you internalize this rule, the entire inference performance analysis logic becomes solid.

VII. Re-checking: What About Long Context?

In Section V we cheated — we threw away the KV term. What happens if a user holds a really long conversation (say, analyzing a 200K-token document) and KV grows too large to ignore?

Geometrically:

The KV term makes T_memory’s slope steeper
T_compute is unchanged
The two lines’ crossover shifts to the right

🔑 Conclusion: long-context scenarios need a larger batch to reach balance.

This also explains a real phenomenon: services handling long documents/long conversations rely more heavily on user volume to operate. KV pressure shifts the sweet spot rightward, requiring more concurrent users to fill the GPU. Without enough users, machines wobble in the inefficient region — that’s why long-context APIs are both expensive and hard to operate.

Going further: at some critical context length, KV-move time fully overtakes parameter-move time, and the system flips from “weight-bound” to “KV-bound.” This is the physical root of the tier pricing Gemini 1.5 Pro historically applied past long-context thresholds — the threshold isn’t arbitrary, it’s computed.

Below is an interactive simulator. Drag L, b, BW, N_active and watch how the crossover B* between T_compute and T_memory shifts:

T_compute / T_memory Batch Size Simulator

T_compute = B × N_active / C, T_memory = N_total / BW + B × L × b / BW

Defaults: DeepSeek V3 (671B total / 37B active, MLA brings KV down to ~70 KB / token) on H200 (4.8 TB/s HBM3e, ~1 PFLOPS effective FP8), with a 1K context (single-turn-query scale). The sweet spot sits at B* ≈ 6400, clearly visible. Drag L up to 8K to enter the long-context regime — you'll see the sweet spot slide off the right edge into KV-dominated land. Bigger GPUs or multi-GPU parallelism don't save you, because the C/BW ratio (~300) barely changes across hardware generations.

Current B

Current batch size — marked as a vertical line on the chart

18K

Max B

X-axis range. Common values: 512 / 2048 / 4096

1288K

N_active

Parameters activated per token. MoE models are typically smaller

params

1M500G

Effective compute throughput

ops/s

10G5P

N_total

Total params (or data) that must be read from memory

params

1M1T

Effective memory bandwidth

B/s

1G10T

Average sequence / KV length per user in the batch, up to 1M tokens

token

1281M

Total KV bytes per token (= 2 × layers × KV heads × head_dim × bytes_per_element); typically 4 KB ~ several hundred KB

B/token

1.02K1.05M

Bottleneck

Memory-bound

Crossover B* ≈ 6.44K · Estimated Latency 141.7 ms

T_compute @ B=128

4.736 ms

B × N_active / C

T_memory @ B=128

141.7 ms

weight + KV

Estimated Latency

141.7 ms

max(T_compute, T_memory)

Memory weight term

139.8 ms

N_total / BW — sets the intercept

Memory KV term

1.957 ms

B×L×b/BW, 1.38% of total

Memory slope

0.01529 ms

T_memory increase per +1 B

Reading the chart

T_compute slope is N_active / C; T_memory slope is L × b / BW, intercept N_total / BW. b is the total KV bytes per token — it already bakes in layers, KV heads, and head_dim, landing in the few-KB to few-hundred-KB range — so raising L or raising b will immediately steepen T_memory visibly.

Current parameters

N_active=37G, C=1P ops/s, N_total=671G, BW=4.8T B/s, L=1.02K, b=71.68K B/token.

Compute / Memory ratio: 3.34e-2. KV share: 1.38%. Increasing L, increasing b, or lowering BW will all visibly steepen T_memory.

VIII. Wrapping Up with One Table

Symbol	Meaning	Determined by
`B`	batch size, concurrent users	scheduling policy
`L`	context length	user behavior
`N_total`	total model parameters	model design
`N_active`	parameters activated per token	model design (sparsity)
`b`	bytes per token in the KV cache (= 2 × num_layers × num_kv_heads × head_dim × bytes_per_element)	model architecture + numeric precision
`C`	effective compute speed (≈ FLOPS / 2)	hardware constant
`BW`	memory bandwidth	hardware constant
`C/BW`	≈ 300, the hardware scale	hardware constant

Core equation:

T = \max\left(\frac{N_{total}}{BW} + \frac{B \cdot L \cdot b}{BW},\ \frac{B \cdot N_{active}}{C}\right)

Sweet spot (when KV is negligible):

B_{\text{optimal}} \approx 300 \times \frac{N_{total}}{N_{active}}

IX. A Few Questions to Chew On

If a company claims a “new algorithm” makes inference 10× faster, what should you ask? Hint: did they change N_active, b, or C/BW?
What’s the optimal batch for a dense model (N_total = N_active)? Why do dense models suffer more in long-context scenarios?
If a future GPU 10×s BW while keeping C fixed, what does C/BW become? How does the sweet spot move? Which model class benefits the most?
If a service has only a few users but every user holds super-long conversations, where does its batch get stuck? What happens to its margin?

🚀

Up next: we can compute latency now, but a latency chart isn’t a cost chart — same equations, different Y-axis, completely different story. Next post we switch to the operator’s perspective: why does running without batching cost thousands of times more? Where do AI services’ famous scale economies actually come from? And why does “expensive + fast” work, while “cheap + slow” runs straight into a physical wall?