In Word2Vec we learned that each word gets only one fixed vector, so the model cannot understand the context of the sentence.
But humans understand that the meaning is different because we look at the entire sentence.
So researchers started asking:
❓ Can a model understand the meaning of a word based on the full sentence?
This led to the development of the Transformer model.
The Transformer is a deep learning architecture that helps computers understand relationships between words in a sentence.
Complete Transformer Architecture
The full Transformer has two parts:
Encoder
Decoder
Input Sentence
↓
Encoder
↓
Decoder
↓
Output Sentence
Encoder
Purpose: Understand the input sentence
Contains:
- Self-Attention
- Feed Forward Network
Decoder
Purpose: Generate the output sentence
Example:
Input (English):
I love machine learning
Output (French):
J’aime l’apprentissage automatique
Decoder generates the output word by word.
Simple Comparison
| Part | Role |
| Encoder | Understand input text |
| Decoder | Generate output text |
But in many NLP tasks, we do not need to generate a new sentence.
We only need the model to understand the meaning of the given text.
Examples:
- sentiment analysis
- text classification
For these tasks, only the Encoder part is required.
Because of this, models like BERT use only the Transformer Encoder.
BERT focuses on understanding the context of words in a sentence, not generating new sentences.
BERT = Bidirectional Encoder Representations from Transformers
Bidirectional → reads context from both directions
Encoder → uses Transformer encoder layers
Transformers → built on Transformer architecture
BERT is a language model that understands the meaning of text using the Transformer encoder.
BERT understands the meaning of a word by looking at the entire sentence context.
Example:
Sentence 1
He went to the bank to deposit money.
Sentence 2
He sat near the bank of the river.
BERT looks at all surrounding words and understands that bank has different meanings in different sentences.
BERT reads a sentence in both directions.
Example:
Sentence:
The boy ate the cake because he was hungry.
To understand he, BERT looks at:
- words before → boy ate the cake
- words after → was hungry
So it understands he refers to boy.
The key idea of transformer encoder is something called Attention.
Attention
Example sentence:
The animal didn’t cross the street because it was too tired.
Here the word “it” refers to animal.
The model must learn which word is important.
The Attention mechanism helps the model focus on the most relevant words in the sentence.
Attention tells the model which words to focus on while understanding a sentence.
Core Idea of Transformer encoder
Unlike older models like:
- RNN
- LSTM
which read sentences one word at a time, the Transformer reads the entire sentence at once.
Benefits:
- Faster training
- Better context understanding
- Captures long-distance relationships
Basic Components of Transformer encoder
Overall Flow
Input Sentence
↓
Word Embeddings
↓
Positional Encoding
↓
Self-Attention
↓
Feed Forward Neural Network
↓
Output Representation
1. Word Embedding
What it does
Computers cannot understand words directly, they only understand numbers.
So the first step is to convert each word into a vector (numbers).
Example sentence:
The cat drinks milk
After embedding:
| Word | Vector (example) |
| The | [0.21, 0.45, 0.67] |
| cat | [0.11, 0.90, 0.32] |
| drinks | [0.56, 0.12, 0.78] |
| milk | [0.34, 0.67, 0.89] |
These vectors capture semantic meaning of words.
Word Embedding converts words into numerical vectors so the model can process them.
2. Positional Encoding
Why it is needed
Transformers process all words at the same time, not sequentially like RNN/LSTM.
Because of that, the model does not automatically know the order of words.
Example:
Sentence 1
Dog bites man
Sentence 2
Man bites dog
Both contain the same words but different meaning.
So we add position information.
Example:
| Word | Position |
| The | 1 |
| cat | 2 |
| drinks | 3 |
| milk | 4 |
This position information is added to embeddings.
Positional Encoding tells the model the order of words in the sentence.
3. Self-Attention (Most Important Part)
This is the core idea of Transformers.
What it does
Self-attention helps the model understand which words in the sentence are important for each other.
Example sentence:
The boy ate the cake because he was hungry.
Suppose, Focus word: “he”
Possible answers:
- boy ✅
- cake ❌
- hungry ❌
Assign Importance
| Word | Importance for “he” | Why? |
| boy | ⭐⭐⭐ (high) | “he” refers to a person |
| cake | ⭐ (low) | cake is not a person |
| hungry | ⭐⭐ | somewhat related (state) |
The word “he” refers to boy.
Self-attention allows the model to connect these related words, even if they are far apart.
Example attention relationship:
| Word | Important words |
| he | boy |
| cake | ate |
| hungry | boy |
Self-Attention helps the model focus on important words in the sentence while understanding meaning.
After Self-Attention
Self-attention has already:
✔ Connected words
✔ Shared information between them
So now each word vector becomes context-aware
Example (just for idea):
cat → [0.3, 0.8, 0.2] (knows it’s subject)
drinks → [0.7, 0.1, 0.6] (knows action)
milk → [0.5, 0.6, 0.9] (knows object)
👉 Important:
These vectors now contain context information
Now what is missing?
👉 These vectors are:
- Mixed information
- Not fully “processed” yet
They are raw mixed signals, not yet refined
What does “mixed information” mean?
After self-attention:
👉 Each vector contains:
- its own meaning
-
- information from other words
Example:
cat vector =
(cat meaning)
+ (info from drinks)
+ (info from milk)
👉 So it becomes a combination of many signals
💡 Analogy (very clear)
Think of:
🟢 Self-Attention
👉 Like a group discussion
Everyone shares ideas → you collect information
🟣 But after discussion
👉 Your thoughts are:
- mixed
- not organized
- not finalized
👉 That is what we mean by:
“not fully processed”
4. Feed Forward Neural Network
After attention finds the relationships, the result goes through a small neural network.
What FFN does
Now FFN:
👉 Cleans and refines that mixed vector
Example:
Before FFN:
cat → [0.3, 0.8, 0.2] (mixed info)
After FFN:
cat → [0.9, 0.1, 0.7] (clearer representation)
👉 Self-Attention = mix information between words
👉 FFN = process each word deeply
| Component | Meaning |
| Self-Attention | Words talk to each other 🗣️ |
| FFN | Each word thinks individually 🧠 |
Words talk to each other” (Self-Attention)
Take sentence:
👉 “The cat drinks milk”
What happens?
👉 Each word looks at other words
Example:
- cat looks at → drinks, milk
- drinks looks at → cat, milk
- milk looks at → drinks
👉 They exchange information
💡 Meaning
cat learns → it is doing action
milk learns → it is being consumed
👉 So:
✔ Words share context
✔ Information is mixed
🟣 2. “Each word thinks individually” (FFN)
Now after mixing…
👉 Each word is processed separately
What does that mean?
👉 No communication now ❌
👉 Each word is handled alone ✔
cat → processed alone
drinks → processed alone
milk → processed alone
💡 Why?
Because now each word has:
- enough context
- combined information
👉 Now it needs to:
✔ refine
✔ strengthen important features
🎯 Key Difference (VERY IMPORTANT)
| Feature | Self-Attention | FFN |
| Interaction | YES (between words) | NO |
| Processing | Shared | Individual |
| Role | Gather info | Refine info |
What it does
- Processes the attention output
- Learns deeper patterns
- Refines the representation of each word
Think of it like a feature processing layer.
The Feed Forward Network processes the information learned from attention and improves the representation.
So, Transformer is made of multiple neural network components, mainly Self-Attention and Feed Forward layers, along with an embedding layer.
5. Output Representation
After these steps, the model produces contextual embeddings.
This means each word vector now depends on the entire sentence.
Example:
Sentence 1
He went to the bank to deposit money
Sentence 2
He sat near the bank of the river
The word bank will have different vectors in both sentences.
The output representation is a context-aware vector for each word.
We just learned about the Transformer architecture, which can understand relationships between words using self-attention and produce contextual word representations.
Researchers then used this architecture to build powerful language models.
One of the most important models is BERT.
Where BERT Is Used
- Chatbots
- Google search
- Question answering systems
- Sentiment analysis
- Text classification
