Understanding Attention

How do AI models focus on what's important? Through Query-Key-Value attention, each word learns to focus on relevant context from other words.

Query-Key-Value Attention

Tokens are mapped to Key, Query, and Value vectors. The model computes attention scores by comparing the Query vector of a token to the Key vectors of all tokens in the sequence. These weights determine how much focus the token should place on each other token's Value vector, integrating some of the other token's Value into itself.

The
cat
chased
the
mouse
Thicker, brighter lines show stronger attention
Select a word to explore
Click any word

Mathematical Representation

The attention mechanism can be visualized as a matrix where each cell represents how much one token attends to another. More yellow colors indicate stronger attention weights. In GPT-2, each transformer block has multiple attention matrices. Each matrix might display a specific relationship, using multiple matrices to allow for more complex relationships.

Note: Each attention head in GPT-2 can specialize in different types of relationships. For example, one head might focus on subject-verb agreement, while another might track semantic relationships between words. The model learns these specializations during training.

Strong semantic connection between "American" and "flag".

The
American
flag
is
red
,
white
,
and
The
0.65
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
American
0.35
0.55
0.00
0.00
0.00
0.00
0.00
0.00
0.00
flag
0.25
0.85
0.70
0.00
0.00
0.00
0.00
0.00
0.00
is
0.25
0.35
0.25
0.60
0.00
0.00
0.00
0.00
0.00
red
0.20
0.25
0.30
0.35
0.65
0.00
0.00
0.00
0.00
,
0.25
0.20
0.25
0.30
0.35
0.55
0.00
0.00
0.00
white
0.20
0.25
0.20
0.25
0.30
0.35
0.60
0.00
0.00
,
0.25
0.20
0.25
0.20
0.25
0.30
0.35
0.65
0.00
and
0.20
0.25
0.20
0.25
0.20
0.25
0.30
0.35
0.70
Attention Head 1 of 4
infrastructure