<< back to Guides

AI Core Concepts (Part 9): Transformers

Transformers are the backbone of modern deep learning models in NLP, vision, and multimodal AI. They power models like BERT, GPT, T5, Vision Transformers, and more.


1. What Is a Transformer?

A Transformer is a neural network architecture introduced in the paper "Attention is All You Need" (Vaswani et al., 2017). It replaces recurrence with self-attention, enabling parallelization and long-range context understanding.

Key Components:


2. Self-Attention Explained

Given input tokens, attention calculates a weighted sum of all tokens based on query, key, and value vectors.

Attention Formula:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) * V

This lets each token "attend" to others, capturing dependencies regardless of distance.


3. Encoder vs Decoder

Component Role Example Use
Encoder Bidirectional understanding BERT, RoBERTa
Decoder Autoregressive generation GPT, LLaMA
Encoder-Decoder Translation, Summarization T5, BART, mT5

4. Example: Minimal Transformer Block (PyTorch)

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, heads)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.ReLU(),
            nn.Linear(embed_dim * 4, embed_dim)
        )
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        attn_out, _ = self.attn(x, x, x)
        x = self.norm1(x + attn_out)
        ff_out = self.ff(x)
        return self.norm2(x + ff_out)

5. Real-World Transformer Models

Model Type Purpose
BERT Encoder Classification, QA
GPT Decoder Text generation
T5 Encoder-Decoder Summarization, translation
ViT Encoder Vision tasks
SAM Encoder-Decoder Image segmentation

6. Applications of Transformers


7. Strengths and Limitations

✅ Strengths:

⚠️ Limitations:


📚 Further Resources


<< back to Guides