Word by Word

At their core, large language models are simply predicting the next word. But how do they do it so well? Let's explore the fundamental mechanics of word prediction.

Autoregression: Next Word Prediction

Language models learn patterns in text by predicting the next word given a sequence of previous words. This simple task, when done at scale with billions of parameters, enables them to generate coherent and contextually appropriate text.

Largelanguagemodelsarejust

Tokenization

Splits text into individual tokens the model can understand

Embedding

Converts tokens into numerical vectors

Position

Adds information about token position in sequence

×12

Transformer

Processes tokens through 12 layers of attention mechanisms

MLP

Predicts probabilities for the next token

infrastructure