Word by Word

At their core, large language models are simply predicting the next word. But how do they do it so well? Let's explore the fundamental mechanics of word prediction.

Autoregression: Next Word Prediction

Language models learn patterns in text by predicting the next word given a sequence of previous words. This simple task, when done at scale with billions of parameters, enables them to generate coherent and contextually appropriate text.

Largelanguagemodelsarejust
Tokenization
Splits text into individual tokens the model can understand
Embedding
Converts tokens into numerical vectors
Position
Adds information about token position in sequence
×12
Transformer
Processes tokens through 12 layers of attention mechanisms
MLP
Predicts probabilities for the next token
infrastructure