In Word2Vec we learned that each word gets only one fixed vector, so the model cannot understand the context of the sentence.
But humans understand that the meaning is different because we look at the entire sentence.
So researchers started asking:
❓ Can a model understand the meaning of a word based on the full sentence?
This led to the development of the Transformer model.
The Transformer is a deep learning architecture that helps computers understand relationships between words in a sentence.
Complete Transformer Architecture
The full Transformer has two parts:
Encoder
Decoder
Input Sentence
↓
Encoder
↓
Decoder
↓
Output Sentence
Encoder
Purpose: Understand the input sentence
Contains:
- Self-Attention
- Feed Forward Network
Decoder
Purpose: Generate the output sentence
Example:
Input (English):
I love machine learning
Output (French):
J’aime l’apprentissage automatique
Decoder generates the output word by word.
Simple Comparison
| Part | Role |
| Encoder | Understand input text |
| Decoder | Generate output text |
But in many NLP tasks, we do not need to generate a new sentence.
We only need the model to understand the meaning of the given text.
Examples:
- sentiment analysis
- text classification
For these tasks, only the Encoder part is required.
Because of this, models like BERT use only the Transformer Encoder.
BERT focuses on understanding the context of words in a sentence, not generating new sentences.
The key idea of transformer encoder is something called Attention.
Attention
Example sentence:
The animal didn’t cross the street because it was too tired.
Here the word “it” refers to animal.
The model must learn which word is important.
The Attention mechanism helps the model focus on the most relevant words in the sentence.
Attention tells the model which words to focus on while understanding a sentence.
Core Idea of Transformer
Unlike older models like:
- RNN
- LSTM
which read sentences one word at a time, the Transformer reads the entire sentence at once.
Benefits:
- Faster training
- Better context understanding
- Captures long-distance relationships
Basic Components of Transformer
Overall Flow
Input Sentence
↓
Word Embeddings
↓
Positional Encoding
↓
Self-Attention
↓
Feed Forward Neural Network
↓
Output Representation
1. Word Embedding
What it does
Computers cannot understand words directly, they only understand numbers.
So the first step is to convert each word into a vector (numbers).
Example sentence:
The cat drinks milk
After embedding:
| Word | Vector (example) |
| The | [0.21, 0.45, 0.67] |
| cat | [0.11, 0.90, 0.32] |
| drinks | [0.56, 0.12, 0.78] |
| milk | [0.34, 0.67, 0.89] |
These vectors capture semantic meaning of words.
Word Embedding converts words into numerical vectors so the model can process them.
2. Positional Encoding
Why it is needed
Transformers process all words at the same time, not sequentially like RNN/LSTM.
Because of that, the model does not automatically know the order of words.
Example:
Sentence 1
Dog bites man
Sentence 2
Man bites dog
Both contain the same words but different meaning.
So we add position information.
Example:
| Word | Position |
| The | 1 |
| cat | 2 |
| drinks | 3 |
| milk | 4 |
This position information is added to embeddings.
Positional Encoding tells the model the order of words in the sentence.
3. Self-Attention (Most Important Part)
This is the core idea of Transformers.
What it does
Self-attention helps the model understand which words in the sentence are important for each other.
Example sentence:
The boy ate the cake because he was hungry.
The word “he” refers to boy.
Self-attention allows the model to connect these related words, even if they are far apart.
Example attention relationship:
| Word | Important words |
| he | boy |
| cake | ate |
| hungry | boy |
Self-Attention helps the model focus on important words in the sentence while understanding meaning.
4. Feed Forward Neural Network
After attention finds the relationships, the result goes through a small neural network.
What it does
- Processes the attention output
- Learns deeper patterns
- Refines the representation of each word
Think of it like a feature processing layer.
The Feed Forward Network processes the information learned from attention and improves the representation.
5. Output Representation
After these steps, the model produces contextual embeddings.
This means each word vector now depends on the entire sentence.
Example:
Sentence 1
He went to the bank to deposit money
Sentence 2
He sat near the bank of the river
The word bank will have different vectors in both sentences. The output representation is a context-aware vector for each word.
