AI & LLMs

Tokens and context windows, explained

2026-06-30 · 5 min read

Two ideas unlock most of the confusing behaviour of language models: tokens (how they read) and the context window (how much they can hold at once). Get these and things like "why did it forget what I said?" suddenly make sense.

Models read tokens, not words

Before a model sees your text, it's chopped into tokens — chunks that are often a whole short word, sometimes a piece of a longer one. A rough rule: one token ≈ 4 characters ≈ ¾ of a word. "Understanding" might split into under + stand + ing. The model only ever works in these units — reading them, and predicting the next one.

The context window is a fixed-size desk

The window holds a fixed number of tokens. As new text arrives, the oldest scrolls off the edge.

The context window is the maximum number of tokens the model can consider at once — and it has to fit everything: the system instructions, the whole conversation so far, your latest message, and the room for its reply. It's one desk of a fixed size. Pile on more paper and the sheets at the back fall off.

The gist

The model reads in tokens and can only hold a fixed number at a time. When a chat outgrows that limit, the oldest parts drop out of view — which is exactly why it seems to "forget."

Why this matters in practice

Cost and speed scale with tokens — you pay per token, so a bloated prompt is a slower, pricier one.
Long chats lose the start. If early details matter, restate them or summarise rather than assuming they're still "in the room."
Chunking exists because of this limit: big documents are split and only the relevant pieces are retrieved and fed in — the same semantic search that powers RAG.

Once you picture that fixed-size desk, prompt engineering stops feeling like magic and starts feeling like packing a bag: bring what matters, leave the rest.

LLMTokensContext windowPrompting

← Back to the blog