Overview

11 Jan 2025 on Nlp, Transformer, Attention

The “Attention is All You Need” paper by Vaswani et al. (2017) introduced the Transformer architecture, which revolutionized natural language processing by dispensing with recurrence and convolutions entirely, relying solely on attention mechanisms. This paper laid the foundation for modern large language models like GPT, BERT, and T5.

Key Contributions:

Introduced the Transformer architecture based entirely on attention mechanisms
Achieved state-of-the-art results on machine translation tasks
Enabled parallelization during training, making it much faster than RNNs
Established the foundation for modern pre-trained language models

Core Ideas

Self-Attention Mechanism

The core innovation is the scaled dot-product attention mechanism: \[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

Where:

$Q$ = Query matrix (what we’re looking for)
$K$ = Key matrix (what we’re comparing against)
$V$ = Value matrix (what we actually return)
$d_k$ = dimension of the key vectors (used for scaling)

Why scaling by $\sqrt{d_k}$?

Prevents the dot products from growing too large
Keeps the softmax function in regions where it has useful gradients
For large $d_k$, dot products grow large in magnitude, pushing softmax into regions with extremely small gradients

Multi-Head Attention

Instead of using a single attention function, the model uses multiple “attention heads”: \[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O\]

Where each head is: $\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

Parameters:

$W_i^Q \in \mathbb{R}^{d_{model} \times d_k}$ (Query projection)
$W_i^K \in \mathbb{R}^{d_{model} \times d_k}$ (Key projection)
$W_i^V \in \mathbb{R}^{d_{model} \times d_v}$ (Value projection)
$W^O \in \mathbb{R}^{hd_v \times d_{model}}$ (Output projection)

Typical dimensions:

$h = 8$ (number of heads)
$d_k = d_v = d_{model}/h = 64$ (when $d_{model} = 512$)

Transformer Blocks

Encoder Block

Each encoder layer consists of:

Multi-Head Self-Attention
Add & Norm (Residual connection + Layer Normalization)
Feed-Forward Network
Add & Norm

Mathematical representation:

x' = LayerNorm(x + MultiHeadAttention(x, x, x))
output = LayerNorm(x' + FFN(x'))

Feed-Forward Network (FFN)

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

Inner dimension: $d_{ff} = 2048$ (typically 4× the model dimension)
Uses ReLU activation
Applied to each position separately

Decoder Block

Each decoder layer consists of:

Masked Multi-Head Self-Attention (prevents looking at future tokens)
Add & Norm
Multi-Head Cross-Attention (attends to encoder output)
Add & Norm
Feed-Forward Network
Add & Norm

Attention Heads

Different attention heads learn to focus on different types of relationships:

Examples of what heads learn:

Syntactic relationships: Subject-verb agreement, noun-adjective dependencies
Semantic relationships: Coreference resolution, entity relationships
Positional patterns: Local vs. long-range dependencies
Task-specific patterns: Different heads specialize for different aspects of the task

Head specialization visualization:

Some heads focus on the previous token
Some heads focus on rare words
Some heads focus on delimiter tokens
Some heads attend to semantically similar words

Positional Encoding

Since Transformers have no inherent notion of sequence order, positional encodings are added to input embeddings: \[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)\] \[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)\]

Where:

$pos$ = position in the sequence
$i$ = dimension index
$d_{model}$ = model dimension

Properties of sinusoidal encoding:

Each dimension corresponds to a sinusoid with different wavelengths
Allows the model to learn relative positions
Can extrapolate to sequence lengths longer than those seen during training
The encoding for position $pos + k$ can be represented as a linear function of the encoding for position $pos$

Layer Normalization

Applied before each sub-layer (Pre-LN) in modern implementations: \[\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta\]

Where:

$\mu$ = mean across the feature dimension
$\sigma$ = standard deviation across the feature dimension
$\gamma, \beta$ = learnable parameters

Why Transformer over Recurrent Architectures

Computational Advantages

Parallelization:
- RNNs process sequences sequentially: $h_t = f(h_{t-1}, x_t)$
- Transformers process all positions simultaneously
- Massive speedup during training
Memory Requirements:
- RNNs: $O(n \cdot d)$ sequential operations
- Transformers: $O(1)$ sequential operations
- Better GPU utilization

Modeling Advantages

Long-Range Dependencies:
- RNNs suffer from vanishing gradients over long sequences
- Transformers have constant path length between any two positions
- Direct connections via attention mechanism
Information Flow:
- RNNs create information bottlenecks at each timestep
- Transformers allow information to flow directly between any positions
- No information loss through sequential processing

Performance Comparison

Aspect	RNN/LSTM	Transformer
Sequential Operations	$O(n)$	$O(1)$
Parallel Processing	No	Yes
Path Length	$O(n)$	$O(1)$
Memory per Layer	$O(d)$	$O(n^2)$
Training Speed	Slow	Fast
Long Dependencies	Weak	Strong

Mathematical Complexity Analysis

Self-Attention Complexity:

Time: $O(n^2 \cdot d)$ where $n$ = sequence length, $d$ = model dimension
Space: $O(n^2)$ for attention weights

RNN Complexity:

Time: $O(n \cdot d^2)$
Space: $O(d)$ for hidden state

Trade-off: Transformers are more efficient when $n < d$, which is typical for most NLP tasks.

Model Architecture Details

Full Architecture

Encoder: 6 identical layers
Decoder: 6 identical layers
Model dimension: $d_{model} = 512$
Attention heads: $h = 8$
Feed-forward dimension: $d_{ff} = 2048$
Dropout: 0.1

Training Details

Optimizer: Adam with custom learning rate schedule
Learning rate: $lr = d_{model}^{-0.5} \cdot \min(step^{-0.5}, step \cdot warmup_steps^{-1.5})$
Warmup steps: 4000
Label smoothing: $\epsilon = 0.1$

Impact and Legacy

The Transformer architecture became the foundation for:

BERT (Bidirectional Encoder Representations)
GPT series (Generative Pre-trained Transformers)
T5 (Text-to-Text Transfer Transformer)
Modern LLMs (ChatGPT, Claude, LLaMA, etc.)

The paper’s core insight—that attention mechanisms alone are sufficient for sequence modeling—fundamentally changed the landscape of NLP and AI.

Overview

Core Ideas

Self-Attention Mechanism

Multi-Head Attention

Transformer Blocks

Encoder Block

Feed-Forward Network (FFN)

Decoder Block

Attention Heads

Positional Encoding

Layer Normalization

Why Transformer over Recurrent Architectures

Computational Advantages

Modeling Advantages

Performance Comparison

Mathematical Complexity Analysis

Model Architecture Details

Full Architecture

Training Details

Impact and Legacy

Stochastic Scribbler ✨

Error

Core Ideas

Self-Attention Mechanism

Multi-Head Attention

Transformer Blocks

Encoder Block

Feed-Forward Network (FFN)

Decoder Block

Attention Heads

Positional Encoding

Layer Normalization

Why Transformer over Recurrent Architectures

Computational Advantages

Modeling Advantages

Performance Comparison

Mathematical Complexity Analysis

Model Architecture Details

Full Architecture

Training Details

Impact and Legacy

Templates (for web app):

Error