π Hello, World!
This is a casual blog about AI, product design, and travel.
- Built with nextraΒ
- Deployed via Vercel
- Hosted on GitHub
- On PC, site development is done with VSCode, and Markdown content is edited with Typora
- On iPad, Working Copy is used to connect with the GitHub repository for content management, and Tiao is used for Markdown editing
The solution is basically free (I bought a domain for a better experience), stable in service, and offers good access speed both domestically and internationally. Itβs enjoyable to read on both PC and mobile. With git, multi-device synchronization and version management are possible. Writing on iPad is also a pleasure. Overall, Iβve found a solution Iβm quite satisfied with.
This site will be updated from time to time with insights from a product managerβs work, tool/product experience sharing, travel stories, and more. I hope it can be helpful or inspiring to you β€οΈ
π Recently Published
- 02: Token and Embedding: How Language Becomes Numbers
A beginner-friendly explanation of how LLMs turn text into tokens, token ids, and high-dimensional embeddings so language can enter neural network computation.
- 01: The First Principle of LLMs: Token Prediction
A beginner-friendly explanation of why large language models are not knowledge databases, but probabilistic systems that compress patterns in language by predicting the next token.
- βοΈ The Math Behind LLM Pricing 05: From One GPU to a Cluster β Parallelism and Interconnect
The first four posts stayed inside a single GPU. This one goes out to the cluster β how models are sliced (pipeline / expert / tensor), what a rack actually is, why MoE strongly prefers a single rack, why scale-up solves bandwidth rather than capacity, and why 1T-scale models only became economically feasible with Blackwell.
- βοΈ The Math Behind LLM Pricing 03: From Inference Latency to Inference Cost
From the latency chart to the cost chart β splitting per-token cost into "amortizable parameter movement + non-amortizable KV + non-amortizable compute." Geometrically shows why running without batching is thousands of times worse, where AI scale economies actually come from, and why "cheap + slow" is physically impossible.
- βοΈ The Math Behind LLM Pricing 04: Cracking Open the KV Cache, the Villain
Cracking open the KV cache β what those bytes actually contain, what GQA / MLA / cross-layer sharing each solve, and a reverse-engineering of Gemini's internal architecture from its public long-context pricing. Plus a three-way framework for telling apart "faster compute" vs "amortization fix" vs "hardware shift" innovations.
- βοΈ The Math Behind LLM Pricing 01: How Inference Actually Works β Starting with "Moving Stuff vs. Computing Stuff"
A non-technical primer on LLM inference β moving stuff vs. computing stuff. Why ChatGPT outputs one token at a time, why long contexts suddenly cost more, and why no money buys instant answers. Builds intuition for memory-bound, batching, and KV cache.

