Skip to Content
AI Era✏️ The Math Behind LLM Pricing 03: From Inference Latency to Inference Cost

✏️ The Math Behind LLM Pricing 03: From Inference Latency to Inference Cost

📖

Last post we wrote inference as equations and solved for the latency-optimal “sweet spot.” This post, let’s look at the same equations from a different angle — switch from the user’s perspective to the operator’s. Same underlying formulas; the Y-axis changes, and the whole story changes with it. By the end, you’ll understand why running without batching costs thousands of times more than at the sweet spot, where AI services’ famous scale economies actually come from, what the real business logic of third-party inference providers is, and why “expensive + fast” is doable while “cheap + slow” hits a physical wall.

🧭

Recap: in the previous post, Writing Inference as Equations, we derived “sweet spot ≈ 300 × inverse sparsity” — the latency-optimal batch size. But that chart answers a user-side question — “how long do I wait?” This post switches to the operator side — “how much do I spend per token produced?”

I. Latency vs. Cost: Two Different Things

Let’s separate two things first: latency and cost look at the same inference process, but from completely different angles.

Latency (the chart from the last post)

  • I’m the user, staring at the screen
  • I care about: how long do I wait for a token?
  • Y-axis = T (total time to generate one token)

Cost (the chart we’ll draw here)

  • I’m the operator (OpenAI, Anthropic, some provider)
  • I care about: how much does each token cost me to produce?
  • Y-axis = ?

So what should the Y-axis actually be? Let’s think it through.


II. Deriving the Per-Token Cost Formula

Where does the money go?

Operating an inference service, what you pay for is “GPU rental time.” An H100 is roughly $2/hour. So the total cost over a span of time:

Total cost=GPU price×duration\text{Total cost} = \text{GPU price} \times \text{duration}

To simplify, set the GPU price to 1 (any monetary unit). Then the total cost of one inference equals T:

Total cost of one inference=T\text{Total cost of one inference} = T

The unit: cost per token

But operators don’t care about “cost per inference” — they care about “cost per token” — because that’s what they bill users by.

So how many tokens did this inference produce?

🔑 Here’s the key: one “forward pass” (moving all parameters once + computing once) produces B tokens — every user in the batch gets one new token.

So:

Cost per token=Total costTokens produced=TB\text{Cost per token} = \frac{\text{Total cost}}{\text{Tokens produced}} = \frac{T}{B}

🔑 This is the T/B engineers keep writing on whiteboards. That’s the Y-axis.


III. Writing T/B Out: Amortizable vs. Non-Amortizable

Recall the core equations from the last post:

T=max(Tmemory, Tcompute)T = \max(T_{memory},\ T_{compute}) Tmemory=NtotalBW+B×L×bBW,Tcompute=B×NactiveCT_{memory} = \frac{N_{total}}{BW} + \frac{B \times L \times b}{BW}, \quad T_{compute} = \frac{B \times N_{active}}{C}

Cost per token = T/B. Divide every term by B, and the three pieces split into surprisingly distinct shapes:

T_compute / B: a compute constant

TcomputeB=B×Nactive/CB=NactiveC\frac{T_{compute}}{B} = \frac{B \times N_{active} / C}{B} = \frac{N_{active}}{C}

🔑 The compute cost is a constant — independent of B. Makes sense: every user has to “do their share of compute,” so the average compute time per user is fixed.

T_memory / B: a hyperbola plus a constant

TmemoryB=NtotalB×BWhyperbola+L×bBWconstant\frac{T_{memory}}{B} = \underbrace{\frac{N_{total}}{B \times BW}}_{\text{hyperbola}} + \underbrace{\frac{L \times b}{BW}}_{\text{constant}}

🔑 The two terms have completely different shapes:

  • The parameter-moving term divided by B becomes a hyperbola — bigger B means smaller value. This is the economic value of batching!
  • The KV-moving term divided by B becomes a constant — because B appears in both numerator (more users → more KV to move) and denominator (split across more users), it cancels exactly.

The latter is exactly what we kept saying — “KV doesn’t amortize.” No matter how big the batch, the per-user KV-movement cost stays the same.

Putting it together: cost per token

🎯
Cost/token=max(NtotalB×BWmove-params (hyperbola)+L×bBWmove-KV (constant), NactiveCcompute (constant))\text{Cost/token} = \max\left(\underbrace{\frac{N_{total}}{B \times BW}}_{\text{move-params (hyperbola)}} + \underbrace{\frac{L \times b}{BW}}_{\text{move-KV (constant)}},\ \underbrace{\frac{N_{active}}{C}}_{\text{compute (constant)}}\right)

Of the three terms, only the first is amortized by B; the other two are not.


IV. Drawing the Cost Chart

Horizontal axis is still B, but Y-axis is now cost per token. Let’s plot each term.

Term 1: move-params / B (hyperbola)

N_total / (B × BW) — bigger B, smaller value:

cost │ ╲ │ ╲ │ ╲ │ ╲ │ ╲ │ ╲___ │ ───────────── └─────────────────────► B 0

🔑 The closer B gets to 0, the more outrageous this term becomes — a single user shoulders all the parameter-movement cost alone, with zero amortization, and ends up insanely expensive. That’s why “running without batching” simply doesn’t make economic sense.

Term 2: move-KV / B (constant, the KV line)

L × b / BW — a horizontal line, height set by context length.

Term 3: compute / B (constant, the compute line)

N_active / C — another horizontal line.

Stacking them: be careful with the geometry

When you stack the three terms, here’s the key question: which is higher, the compute line or the KV line?

For a sweet spot to exist (where the descending hyperbola intersects the horizontal line), the compute-constant line must sit above the KV-constant line:

NactiveC>L×bBW\frac{N_{active}}{C} > \frac{L \times b}{BW}

🔑 Physical meaning: in “normal” model + context configurations, compute is a more “expensive” cost than KV — so beyond the sweet spot, the bottleneck is compute, not KV. This is consistent with the previous post’s assumption that “in short-context scenarios the memory term is negligible.”

So the cost chart actually looks like this:

cost │ ╲ │ ╲ ← hyperbola segment (memory-bound) │ ╲ hyperbola + KV constant │ ╲ │ ╲ │ ╲___ │ ─────────╳───────────── ← compute constant N_active/C │ ↑ sweet spot │ ───────────────────── ← KV constant L×b/BW (below compute) └──────────────────────► B

The final cost curve = take whichever is on top:

cost │ ╲ │ ╲ ← memory-bound region (hyperbola dominates) │ ╲ │ ╲ │ ╲___ │ ────────╳──────────── ← compute-bound region (compute dominates) │ ↑ │ sweet spot └──────────────────────► B 0

V. Reading the Chart: Three Regions

RegionDominatorStateOperator’s situation
B very smallhyperbolacosts explodeoperator hemorrhages money
Sweet spothyperbola = computenear minimum costoptimal operating point
B very largecompute constantflat at compute floorsaturated, no further gain

The cost floor

Notice the height of the right-side horizontal asymptote. Under normal configurations (compute line above KV line):

Cost/token floorNactiveC\text{Cost/token floor} \approx \frac{N_{active}}{C}

🔑 This is the physical lower bound on per-token cost. It’s “the cost of each user’s share of compute, with workers running at full speed” — no batch size can reduce it further, because the compute work is per-user-specific and batching can’t help.


VI. Three Common Questions, Answered

With this chart, several questions that came up in earlier posts get clean geometric answers.

Q1: Where does “non-batched is 1000× worse” come from?

Plug in B=1 (a single user, sole occupancy):

Cost/token(B=1)NtotalBW\text{Cost/token}(B=1) \approx \frac{N_{total}}{BW}

Plug in the sweet spot (B ≈ thousands):

Cost/token(sweet spot)NactiveC\text{Cost/token}(\text{sweet spot}) \approx \frac{N_{active}}{C}

The ratio:

Ntotal/BWNactive/C=NtotalNactive×CBW=inverse sparsity×300\frac{N_{total}/BW}{N_{active}/C} = \frac{N_{total}}{N_{active}} \times \frac{C}{BW} = \text{inverse sparsity} \times 300

For DeepSeek V3: 18 × 300 ≈ 5400×.

🔑 That’s where the “1000× difference” actually comes from — it’s 5400×. Running without batching just doesn’t pencil out; this isn’t engineering laziness, it’s physics.

Of course nobody actually runs at B=1 in production — continuous batching, request queueing, and other mechanisms almost always pull B up. The 5400× is a theoretical upper bound; its real meaning isn’t “you’d really lose this much,” it’s: batching isn’t a performance tweak, it’s an economic necessity.

Q2: Why do AI services have such strong scale economies?

Picture the cost curve as a ladder:

  • Big players (OpenAI / Anthropic): plenty of users, batch always pinned at the sweet spot → per-user cost near the floor
  • Small players: not enough users to reach the sweet spot, e.g. batch maxes at 500 while optimal is 5000 → stuck mid-way down the hyperbola
  • Dedicated mode: one enterprise customer monopolizes a GPU → stuck at the leftmost edge, costs scaling as 1/B

🔑 “User volume directly converts to margin” is a structural property of AI services — as long as you’re still on the hyperbola, every additional user is decreasing marginal cost. That’s why big providers have a structural cost advantage over small ones — small providers can’t fill batches, and pay a permanent per-user cost premium.

Q3: What’s the business logic of third-party inference providers?

For shops like Together AI, Fireworks, DeepInfra, and other third-party inference providers, the business model is essentially monetizing the curve above:

📌 The essence: bundle many small customers’ traffic into one big batch — combining loads no single customer could fill into a batch big enough to hit the sweet spot. The provider’s margin is the difference batching unlocks.

📎 Quick clarification: aggregator API gateways like OpenRouter don’t run inference themselves — they route requests to upstream providers and take a billing cut. The actual batch pooling happens at the upstream inference providers. OpenRouter’s value is more about a unified interface and model selection, not batch economics.

Aggregation has a ceiling either way: different models can’t share a batch, so an inference provider’s effective batch is still constrained by the model distribution it serves.


VII. Back to the “Expensive + Fast” vs. “Cheap + Slow” Mystery

The first post asked a question: can you get “100× price for 100× speed,” or “100× slower for 100× cheaper”? Now we can answer it with the chart ourselves.

Overlay the latency chart and the cost chart

Latency chart (user view) Cost chart (operator view) T Cost/token │ ╱ │ ╲ │ ╱ │ ╲ │ ╳ ← latency optimum at │ ╲___ │ ╱ crossover │ ╳ ← cost optimum at crossover │ │ ─────── flat afterwards │ │ └──────► B └──────────► B ↑ ↑ Before crossover: Before crossover: more batch doesn't hurt latency more batch significantly cuts cost After crossover: After crossover: more batch directly raises latency more batch barely cuts cost

🔑 Key geometric fact: the crossover B is the same on both charts — both equal B = 300 × inverse sparsity, because they share the same underlying equations.

What this means

The operator faces a clear choice:

  • B = sweet spot: cost near floor ✅, latency near floor ✅, perfect
  • B > sweet spot: cost barely drops further, but latency rises linearly
  • B < sweet spot: latency unchanged (still at physical floor), cost scales as 1/B — pure loss

🔑 A rational operator should pin Batch at the sweet spot. This isn’t “an agonizing trade-off between user experience and cost” — both are simultaneously optimal at the same point. This is one of the most elegant properties of LLM inference economics.

What “fast tiers” are actually doing

The “fast tiers” various providers offer (some charge a few times the price for two-or-three-times the speed) are doing something simple: they let you take up a bigger share of the batch, dragging the system from the sweet spot toward “smaller B.”

  • Normal mode: you share a machine with thousands of other users
  • Fast tier: you “occupy N user slots,” effectively shrinking batch → moving leftward on the latency chart
  • But the latency chart’s left side is flat (you’ve already hit the physical floor), so speed only goes up by a bounded multiple
  • Meanwhile the cost curve gets pushed onto the hyperbola segment, costs surge → hence the multi-x price tag

🔑 Why can’t you get “100× price for 100× speed”? Because the latency floor — parameter-movement time — is a physical wall. The most a fast tier can do is pull you from the compute-bound region back to the memory-bound region, hitting that latency floor — no amount of money breaks the wall.

What about “cheap + slow”?

It doesn’t work either. Because the cost curve is flat beyond B = sweet spot — willing to wait longer doesn’t save you any money.

🔑 Conclusion:

  • ✅ “Expensive + fast” works (until you hit the wall)
  • ❌ “Cheap + slow” doesn’t (the cost floor is a hard physical constant)

VIII. Wrapping Up with One Table

QuantityFormulaGeometryDependence on B
Param-move cost / tokenN_total / (B × BW)hyperbolaamortizable (cheaper as B grows)
KV-move cost / tokenL × b / BWhorizontal linenon-amortizable
Compute cost / tokenN_active / Chorizontal linenon-amortizable
Total cost / tokenmax(hyper+KV, compute)falls then flattensminimum at sweet spot
Cost floor≈ N_active / Cright-side asymptotelimit as B → ∞

Core equation:

 Cost/token=max(NtotalBBW+LbBW, NactiveC) \boxed{\ \text{Cost/token} = \max\left(\frac{N_{total}}{B \cdot BW} + \frac{L \cdot b}{BW},\ \frac{N_{active}}{C}\right)\ }

Where “non-batched is X× worse” comes from:

 Cost(B=1)Cost(sweet spot)300×NtotalNactive \boxed{\ \frac{\text{Cost}(B=1)}{\text{Cost}(\text{sweet spot})} \approx 300 \times \frac{N_{total}}{N_{active}}\ }

IX. A Few Questions to Chew On

  1. A provider quotes you a GPT-4 class model at $0.5/1M tokens (output). You estimate the compute floor corresponds to roughly $0.3/1M tokens. Where on the cost curve is this provider? If their user volume suddenly drops 50%, what happens?
  2. If the KV horizontal line rises above the compute line, what changes geometrically? Does the sweet spot still exist? What does the cost floor become? When would this happen?
  3. Provider A serves only “short conversations + high frequency” users (small KV + big B). Provider B serves “very long conversations + low frequency” (big KV + small B). How do their cost curves differ? Which is more fragile?
  4. If a future GPU 10×s BW while keeping C fixed (a more balanced piece of hardware), how does the cost curve change shape? Where does the sweet spot move? What does “non-batched is X× worse” become?

🚀

Up next: that “troublemaker” we’ve been hinting at in the background — the KV cache — finally gets unpacked head-on. What exactly is that mysterious b (bytes per token)? Why do different models differ so wildly? What problem does each of GQA, MLA, and cross-layer sharing actually solve? The most fun part: we’ll reverse-engineer per-token bytes from Gemini 1.5 Pro’s historical long-context tier pricing — any AI API’s pricing structure works like an X-ray, letting you see the internal architecture from the outside.

Last updated: