01: The First Principle of LLMs: Predicting the Next Token

Abstract token tiles flowing through a neural network and branching into candidate next tokens

🧭

This is the first post in the “Understanding LLMs from First Principles” series.

This post starts from the everyday experience of using an LLM and works through it step by step, to answer why the mechanism of “given context, predict the next token” is the starting point for understanding large language models.

Have you ever wondered how large language models can seem to “answer anything,” and appear fluent at everything from poetry and prose to strategy, code, and creative work?

This series uses first principles thinking to take LLMs apart and analyze the principles behind them.

We invoke “first principles” not to trace the idea back to Aristotle’s discussion of it, nor to chant a slogan. In a product and engineering context, we usually treat “first principles” as a method of decomposition:

Return to the most basic conditions of a thing, break it into elements that are hard to reduce further, then reason upward from those elements toward the path that can achieve the goal.

Following this line of thinking, we need to figure out two things:

What is the goal? Why do we need large language models at all?
What are the key elements? What are the most basic, irreducible elements for realizing a large language model?

At a high level, in order to further liberate productivity, humanity keeps pursuing artificial general intelligence (AGI): enabling machines to understand and handle open-ended human tasks — answering questions, explaining concepts, writing code, summarizing articles, translating sentences, planning steps, and continuing a conversation based on context. These tasks look different on the surface, but they share the same underlying form: they can all be written as a language input, and we expect the system to produce a suitable language output.

That gives us a computable goal:


Given the context so far, generate the most suitable continuation.

This also reveals the two most critical elements of a large language model:


1. Context
2. The ability to generate what comes next

Large language models differ from search engines, knowledge bases, databases, and expert systems: they are not built to store one fixed answer, but to learn the regularities of how context leads to what comes next in human language — so they can take on more general tasks and answer questions they have never seen before.

From here, “first principles” in this series is not a mysterious slogan. It is a chain of reasoning we can derive:


Goal: handle open-ended human language tasks
Path: unify those tasks as "given context, generate continuation"
Training: learn next-token probability distributions from massive text corpora
Generation: sample and unfold token by token through an autoregressive process
Result: compress the distribution of language, indirectly learning structures behind language

Next, we will look inside the core mechanism of large language models and gradually unpack the technical principles behind it.

1. The Model Sees Tokens, Not Sentences

A continuous stream of content split into token tiles of different sizes before entering a model input layer

Let’s start from the way we use these models today. You are probably already used to asking things like:


Summarize this article for me.
Write a piece of code.
Help me think through a product plan.
Make this sentence sound more natural.

When you send these prompts to a model, the answer often feels fluent and natural, almost as if it really “knows” the answer.

If you watch closely, though, you will notice that the model’s answer appears word by word. Technically, this is called streaming output.

This word-by-word output is not only a product feature. It also corresponds to the model’s basic mechanism:

At its core, the model is trained to predict token probabilities.

For now, you can think of a token as the smallest chunk of content the model treats as a unit. It may be a character, a word, a punctuation mark, a code symbol, or some other piece of text.

After the user enters a prompt, a large language model does not first think up a complete answer in its head and then type it out. Instead, it keeps repeating this loop:


Given what has already been seen, compute a probability distribution over the next small chunk;
select one candidate chunk;
append it to the context;
predict the next chunk again;
repeat until the answer is finished.

Sounds like autocomplete, right?

That intuition is not wrong, but it is not enough. Ordinary autocomplete fills in a few common words. A large language model is trained across massive amounts of text, code, papers, web pages, conversations, and mathematical reasoning, until it becomes an extremely complex prediction system. Once this prediction task is scaled far enough, it is no longer just guessing words. In order to predict well, the model is forced to learn the structure behind language: facts, grammar, logic, style, code patterns, the relationship between questions and answers, and the common ways humans complete tasks.

Coming back to tokens, a token is the basic unit a model uses to process the world.

When a person reads this sentence:


I ate an apple today

we naturally understand it as one complete meaning: someone ate an apple today.

But the model does not directly receive that complete meaning. Before the sentence enters the model, it is first split into a sequence of tokens. It might look like this, with each word becoming one token:


I / ate / an / apple / today

Or it may be finer, with some words split into smaller subword pieces:


I / ate / an / app / le / today

The exact split depends on the tokenizer, the component that turns text into tokens.

English, Chinese, code, and formulas all go through this process. For example, a piece of code like this:


function add(a, b) { return

does not appear to the model as one complete “function.” It appears as a sequence of symbolic fragments that can be continued.

This is an important shift in perspective. From the model’s point of view, the world is not a continuous photograph, and it is not a ready-made knowledge base. It is a stream of token sequences.

You can imagine it this way: humans write knowledge, experience, institutions, emotions, code, and mathematics into language. The model cuts that language into tokens, then learns how those tokens are arranged and how they relate to one another. After learning from massive amounts of content, something that looks like “intelligence” begins to emerge.

So when we say the model “understands a sentence,” the underlying process is not that it directly grasps the sentence’s intended meaning. It first turns the sentence into token sequences, then builds relationships among those tokens.

This explains many things that may otherwise seem strange:


Why do model APIs charge by token?
Why can Chinese and English have different token costs?
Why do long articles consume more context?
Why do code completion, mathematical symbols, and multilingual text affect model behavior?

Because tokens are the model’s basic unit for processing the world.

2. The Training Task Is Simple: Predict the Next Token

The model uses the true next token as the training target and updates parameters through prediction error feedback

Once we understand tokens, we can look at training. The most basic training task of a large language model is called next-token prediction.

More precisely, it is not trying to predict one uniquely correct answer. It is learning:


Given the current context, what is the probability distribution over the next token?

For example, given “Happy birthday,” it should assign a high probability to “to you”; given the start of a function function add(a, b), it should know the body will most likely be return a + b.

“Guess the next token” is not a difficult task by itself. The difficult part is guessing well across many different kinds of context. If the model has only seen a few thousand sentences, it can mostly learn fixed phrases, such as “nice weather,” “happy birthday to you,” or “Beijing is the capital of China” — that is closer to memorization. But if the model has seen massive amounts of text, the situation changes. It will encounter novels, news articles, legal documents, code repositories, research papers, encyclopedias, forum answers, product docs, math problems, and chat logs.

At that scale, simply memorizing which words often appear together is not enough. To predict better, the model has to gradually learn:


how sentences are organized
how concepts relate to one another
how facts are usually expressed
how variables flow through code
what kind of answer usually fits a question
what step usually follows a reasoning step
how a task is broken into smaller steps

For example, to complete “Beijing is the capital of China,” the model needs to learn the relationship between Beijing and China. To complete a function, it needs to learn parameters, return values, and syntax. To continue an analysis, it needs to learn how one point connects to the next.

So on the surface, the model is learning “what token comes next.” At a deeper level, it is learning:


the structure behind language.

That is why a task as small as “predict the next token” can grow into such large capabilities.

3. The Model Does Not Give an Answer; It Gives Probabilities

A model turns context into many candidate next tokens with different probabilities, from narrow low-temperature sampling to wider high-temperature sampling

So does the model really know the correct answer? Not exactly. At every step, what it produces is a set of probabilities.

We can think of a large language model as a giant function:


f(context) = probability distribution over the next token

For example, given:


Newton proposed

the model might assign probabilities like:


gravity: 70%
the laws of motion: 20%
calculus: 5%
other: 5%

The real situation is more complex: the model calculates a probability for every possible token in its vocabulary. Then the system chooses one token according to certain rules, appends it back into the context, and continues predicting the next one:


you enter a question
→ the model calculates probabilities for the next token
→ the system chooses one token
→ that token is added to the context
→ the model predicts the next token
→ a complete answer is built step by step

This is also why the same question can sometimes produce different answers. If the system always chooses the highest-probability token, the answer will be more stable, but it may also become stiff. If the system is allowed to choose from tokens with slightly lower probabilities, the answer becomes richer and more varied.

Parameters such as temperature and top-p, which you may have seen in model APIs, are ways to tune this process. To put it simply:


lower temperature: steadier, more conservative
higher temperature: more varied, more exploratory

So during inference, a large language model does not first “think through the whole answer” or “create an outline and then fill it in.” It generates part by part, puts what it just generated back into the context, and builds the answer one step at a time. This is the foundation for understanding hallucination, context engineering, reasoning, and agents.

4. Transformer and Attention: “Where Should I Look Now?”

Attention lines of different strengths point from earlier context tokens to the current token being generated

Every time the model predicts a token, it has to refer to everything that came before. But the context holds so much, and its pieces vary in relevance — so where exactly should the current prediction look?

For example, this sentence:


Tom put the backpack on the chair because it was too heavy.

What does “it” refer to?

Humans quickly infer that it probably refers to the backpack, because “too heavy” sounds more like a description of the backpack than the chair.

For the model, this cannot be solved by simply looking at the previous word. It needs to find relationships inside the context.

This is one major reason Transformer matters. The Attention mechanism inside Transformer handles exactly this problem:


When generating the current token, which tokens in the context should matter most?

You can think of Attention as the model’s temporary focusing ability.

It does not look at every word in the context equally. At each step, it reallocates attention: should this step focus on the subject, or on the earlier object? Should it focus on a variable in the code, or on a constraint the user just gave? Should it focus on the beginning of the paragraph, or the last instruction?

This is why Transformer can handle long documents, code, multi-turn conversations, and complex prompts. It is not simply mixing all the text together. It is building relationships between pieces of text.

Of course, Attention is not omnipotent.

The longer the context, the more information the model has to process, and the higher the cost. If key information is buried too deeply, the model may still miss it. If a prompt is messy, the model may focus on the wrong thing.

Later, when we talk about context engineering, RAG, and tool use, they are all, in a sense, trying to solve the same problem: how to put the right information, at the right time, in the place where the model can use it most easily.

5. How Prediction Starts to Look Like Understanding

Many language traces are compressed into reusable structure, making prediction look like understanding at the output layer

At this point, we reach the central question:

If the model is only guessing the next token, why does it look like it understands, reasons, and creates?

The answer is compression. Through language as the medium, the model compresses a huge amount of regularity into its parameters during training. It keeps reading text, predicting, discovering where it guessed wrong, and adjusting its internal parameters, until word combinations, concept relationships, narrative structures, code patterns, and other regularities are all packed in. In the end, model parameters become something like a giant compressed package of patterns. This is what it means to say the model is a compression of the world.

Many parts of human society leave traces in language. Scientific knowledge is written into papers. Business experience is written into case studies. Program logic is written into code. Legal systems are written into statutes. Common sense is written into conversations and stories. When a model learns the distribution of language, it also indirectly learns the structure behind that language.

This does not mean it experiences the world like a person. It has no childhood, no body, and no subjective feelings. Its “understanding” is closer to an ability:


modeling relationships among symbols in a high-dimensional space, then transferring those relationships into new contexts.

That sounds technical. A plainer way to say it is:


It has not seen the world with its own eyes. It has learned many recurring patterns from the ways humans describe the world.

This also explains why large language models are both powerful and fallible.

They are powerful because language really does contain a huge amount of human civilization. They fail because their objective is still to generate the next token that looks reasonable, not to truly understand what is right or wrong.

6. Scaling Law: Pushing a Simple Task Into Complex Capability

As model scale, data, and compute grow, the simple prediction task crosses increasingly complex capability thresholds

If we only look at “guess the next token,” large language models should not seem this capable. But when several dimensions scale together, things change:


more data
more parameters
more compute
longer context
better training methods
higher-quality feedback

The model begins to show abilities that no longer feel like autocomplete:


summarization
translation
coding
style imitation
multi-step reasoning
role play
tool use
product planning
complex task decomposition

Engineers did not write these abilities into the model one by one. Nobody manually programmed “how to summarize an article,” “how to write a Python function,” or “how to play the role of an interviewer” into the weights.

More accurately, the training objective and the scale of the data push the model to a higher position: in order to keep improving its predictions, it must form richer and richer internal representations.

You can picture it as a puzzle.

At first, the model is only guessing where the next piece should go. With only a few pieces, it can remember some local shapes. As more and more pieces appear, it begins to see boundaries, colors, and structure. Eventually, it can even infer what the whole picture might be.

This is what people call emergent capability.

But emergence is not magic. It does not mean a tiny person who truly understands the world suddenly appears inside the model. It is closer to this: a sufficiently complex prediction system, trained on enough data and feedback, learns structures that can generalize.

7. Today’s AI Products Are Not Just One Model

A modern AI product system composed of a model, retrieval, tools, safety boundaries, interface, and feedback loop

One more point matters.

The ChatGPT, Claude, Gemini, and Copilot products we use today are usually not bare models directly facing users.

A modern large-language-model product often stacks many layers together:


pretrained model
+ instruction tuning
+ RLHF / RLAIF
+ system prompts
+ context engineering
+ RAG retrieval
+ tool use
+ agent frameworks
+ permission and safety policies
+ product interaction design

It is fine if some of these terms are unfamiliar for now. We will unpack them later. For the moment, build this intuition:


pretraining teaches the model language and world structure;
instruction tuning makes it better at following tasks;
alignment makes it fit human preferences better;
RAG puts external material into the context;
tool use lets the model search, run code, and operate software;
agent frameworks let the model act step by step toward a goal.

So when we evaluate an AI product, we should not only ask, “Is the model smart?” We should also ask:


What information can it see?
What tools can it use?
Does it verify results?
Can it recover from failure?
Does it have permission boundaries?
Does the interaction let users stay in control?

The model itself provides powerful language and reasoning ability. The product capability we can actually use comes from the combination of the model and the surrounding engineering system.

8. So What Is a Large Language Model?

A large language model summarized as probabilistic prediction, compressed world structure, and a surrounding product system

Now we can return to the opening question. What is a large language model?

First, it is a probabilistic prediction model: given a context, it predicts a probability distribution over the next token.
Second, it gains capability through compression: model parameters are not knowledge cards, but the compressed result of massive patterns in language.
Third, its “understanding” comes from the pressure to predict well: to predict well enough, it has to learn grammar, facts, logic, style, code patterns, and task structures.
Fourth, its reasoning is generated step by step: the model does not first obtain a perfect answer and then output it, but chooses a more reasonable next token at each step, gradually completing the whole response.
Fifth, today’s model capability comes from system composition: model capability = model × data × compute × alignment × context × tools × product design

If we compress all of this into one sentence:


The first principle of LLMs: compress the distribution of human knowledge by predicting the next token.

Once you understand this, many later questions become clearer:

Why do tokens affect cost? Because tokens are the model’s basic unit for processing the world.
Why is context so important? Because every step of generation depends on the current context.
Why do hallucinations happen? Because generating plausible text is not the same as guaranteeing truth.
Why does RAG help? Because it puts traceable new material into the context.
Why are agents an important direction? Because the model itself only generates tokens, while tools and action loops can turn tokens into real tasks.

9. Common Misconceptions

Misconception 1: An LLM is looking things up in a database.

No. A bare model does not open an internal fact table when answering. It uses the current context and patterns compressed into its parameters to compute a probability distribution over the next token. Search, RAG, databases, and tool use are external system capabilities, not the bare model itself.

Misconception 2: An LLM is just advanced autocomplete.

The analogy helps explain token-by-token generation, but it underestimates the capability. Ordinary autocomplete mainly completes common phrases. To predict massive and complex text, an LLM is forced to learn concept relationships, code structure, reasoning steps, and task patterns.

Misconception 3: The model first thinks of the full answer, then prints it word by word.

No. It puts the tokens it has already generated back into the context, then predicts the next step. This is also why a small early error can be amplified later in the generation.

Misconception 4: If the model is large enough, it will automatically be reliable.

Scale can move capability boundaries, but it does not automatically provide factual grounding, permission control, business validation, or product accountability. Reliable AI products come from the combination of model, context, tools, evaluation, and interaction design.

10. Three Questions You Should Be Able to Answer

After reading this post, try answering these in your own words:

Why is “predict the next token” not a shallow task?
Why is model generation a probabilistic path rather than a fixed answer retrieved from a database?
Why should modern AI products be evaluated by more than whether the model is “smart”?

In the next post, we will continue downward: what exactly are tokens and embeddings? And how does language become numbers a model can process?