AI & LLMs
Tokens and context windows, explained
Two ideas unlock most of the confusing behaviour of language models: tokens (how they read) and the context window (how much they can hold at once). Get these and things like "why did it forget what I said?" suddenly make sense.
Models read tokens, not words
Before a model sees your text, it's chopped into tokens — chunks that are often a whole short word, sometimes a piece of a longer one. A rough rule: one token ≈ 4 characters ≈ ¾ of a word. "Understanding" might split into under + stand + ing. The model only ever works in these units — reading them, and predicting the next one.
The context window is a fixed-size desk
The context window is the maximum number of tokens the model can consider at once — and it has to fit everything: the system instructions, the whole conversation so far, your latest message, and the room for its reply. It's one desk of a fixed size. Pile on more paper and the sheets at the back fall off.
The model reads in tokens and can only hold a fixed number at a time. When a chat outgrows that limit, the oldest parts drop out of view — which is exactly why it seems to "forget."
Why this matters in practice
- Cost and speed scale with tokens — you pay per token, so a bloated prompt is a slower, pricier one.
- Long chats lose the start. If early details matter, restate them or summarise rather than assuming they're still "in the room."
- Chunking exists because of this limit: big documents are split and only the relevant pieces are retrieved and fed in — the same semantic search that powers RAG.
Once you picture that fixed-size desk, prompt engineering stops feeling like magic and starts feeling like packing a bag: bring what matters, leave the rest.