BERT Transformer Model

In Word2Vec we learned that each word gets only one fixed vector, so the model cannot understand the context of the sentence.

But humans understand that the meaning is different because we look at the entire sentence.

So researchers started asking:

Can a model understand the meaning of a word based on the full sentence?

This led to the development of the Transformer model.

The Transformer is a deep learning architecture that helps computers understand relationships between words in a sentence.

Complete Transformer Architecture

The full Transformer has two parts:

Encoder

Decoder

Input Sentence
      ↓
Encoder
      ↓
Decoder
      ↓
Output Sentence

Encoder

Purpose: Understand the input sentence

Contains:

  • Self-Attention
  • Feed Forward Network

Decoder

Purpose: Generate the output sentence

Example:

Input (English):

I love machine learning

Output (French):

J’aime l’apprentissage automatique

Decoder generates the output word by word.


Simple Comparison

PartRole
EncoderUnderstand input text
DecoderGenerate output text

But in many NLP tasks, we do not need to generate a new sentence.
We only need the model to understand the meaning of the given text.

Examples:

  • sentiment analysis
  • text classification

For these tasks, only the Encoder part is required.

Because of this, models like BERT use only the Transformer Encoder.
BERT focuses on understanding the context of words in a sentence, not generating new sentences.

BERT = Bidirectional Encoder Representations from Transformers

Bidirectional → reads context from both directions

Encoder → uses Transformer encoder layers

Transformers → built on Transformer architecture

BERT is a language model that understands the meaning of text using the Transformer encoder.

BERT understands the meaning of a word by looking at the entire sentence context.

Example:

Sentence 1
He went to the bank to deposit money.

Sentence 2
He sat near the bank of the river.

BERT looks at all surrounding words and understands that bank has different meanings in different sentences.

BERT reads a sentence in both directions.

Example:

Sentence:

The boy ate the cake because he was hungry.

To understand he, BERT looks at:

  • words before → boy ate the cake
  • words after → was hungry

So it understands he refers to boy.

The key idea of transformer encoder is something called Attention.

Attention

Example sentence:

The animal didn’t cross the street because it was too tired.

Here the word “it” refers to animal.

The model must learn which word is important.

The Attention mechanism helps the model focus on the most relevant words in the sentence.

Attention tells the model which words to focus on while understanding a sentence.

Core Idea of Transformer encoder

Unlike older models like:

  • RNN
  • LSTM

which read sentences one word at a time, the Transformer reads the entire sentence at once.

Benefits:

  • Faster training
  • Better context understanding
  • Captures long-distance relationships

Basic Components of Transformer encoder

Overall Flow

Input Sentence
      ↓
Word Embeddings
      ↓
Positional Encoding
      ↓
Self-Attention
      ↓
Feed Forward Neural Network
      ↓
Output Representation

1. Word Embedding

What it does

Computers cannot understand words directly, they only understand numbers.
So the first step is to convert each word into a vector (numbers).

Example sentence:

The cat drinks milk

After embedding:

WordVector (example)
The[0.21, 0.45, 0.67]
cat[0.11, 0.90, 0.32]
drinks[0.56, 0.12, 0.78]
milk[0.34, 0.67, 0.89]

These vectors capture semantic meaning of words.

Word Embedding converts words into numerical vectors so the model can process them.

2. Positional Encoding

Why it is needed

Transformers process all words at the same time, not sequentially like RNN/LSTM.

Because of that, the model does not automatically know the order of words.

Example:

Sentence 1

Dog bites man

Sentence 2

Man bites dog

Both contain the same words but different meaning.

So we add position information.

Example:

WordPosition
The1
cat2
drinks3
milk4

This position information is added to embeddings.

Positional Encoding tells the model the order of words in the sentence.

3. Self-Attention (Most Important Part)

This is the core idea of Transformers.

What it does

Self-attention helps the model understand which words in the sentence are important for each other.

Example sentence:

The boy ate the cake because he was hungry.

Suppose, Focus word: “he”

Possible answers:

  • boy ✅
  • cake ❌
  • hungry ❌

Assign Importance

WordImportance for “he”Why?
boy⭐⭐⭐ (high)“he” refers to a person
cake⭐ (low)cake is not a person
hungry⭐⭐somewhat related (state)

The word “he” refers to boy.

Self-attention allows the model to connect these related words, even if they are far apart.

Example attention relationship:

WordImportant words
heboy
cakeate
hungryboy

Self-Attention helps the model focus on important words in the sentence while understanding meaning.

After Self-Attention

Self-attention has already:

✔ Connected words
✔ Shared information between them

So now each word vector becomes context-aware

Example (just for idea):

cat → [0.3, 0.8, 0.2]   (knows it’s subject)
drinks → [0.7, 0.1, 0.6] (knows action)
milk → [0.5, 0.6, 0.9]   (knows object)

👉 Important:
These vectors now contain context information

Now what is missing?

👉 These vectors are:

  • Mixed information
  • Not fully “processed” yet

They are raw mixed signals, not yet refined

What does “mixed information” mean?

After self-attention:

👉 Each vector contains:

  • its own meaning
    • information from other words

Example:

cat vector =
   (cat meaning)
 + (info from drinks)
 + (info from milk)

👉 So it becomes a combination of many signals


💡 Analogy (very clear)

Think of:

🟢 Self-Attention

👉 Like a group discussion

Everyone shares ideas → you collect information


🟣 But after discussion

👉 Your thoughts are:

  • mixed
  • not organized
  • not finalized

👉 That is what we mean by:
“not fully processed”

4. Feed Forward Neural Network

After attention finds the relationships, the result goes through a small neural network.

What FFN does

Now FFN:

👉 Cleans and refines that mixed vector

Example:

Before FFN:
cat → [0.3, 0.8, 0.2]  (mixed info)

After FFN:
cat → [0.9, 0.1, 0.7]  (clearer representation)

👉 Self-Attention = mix information between words
👉 FFN = process each word deeply

ComponentMeaning
Self-AttentionWords talk to each other 🗣️
FFNEach word thinks individually 🧠

Words talk to each other” (Self-Attention)

Take sentence:

👉 “The cat drinks milk”


What happens?

👉 Each word looks at other words

Example:

  • cat looks at → drinks, milk
  • drinks looks at → cat, milk
  • milk looks at → drinks

👉 They exchange information


💡 Meaning

cat learns → it is doing action
milk learns → it is being consumed

👉 So:
✔ Words share context
✔ Information is mixed


🟣 2. “Each word thinks individually” (FFN)

Now after mixing…

👉 Each word is processed separately


What does that mean?

👉 No communication now ❌
👉 Each word is handled alone ✔

cat    → processed alone
drinks → processed alone
milk   → processed alone


💡 Why?

Because now each word has:

  • enough context
  • combined information

👉 Now it needs to:
✔ refine
✔ strengthen important features


🎯 Key Difference (VERY IMPORTANT)

FeatureSelf-AttentionFFN
InteractionYES (between words)NO
ProcessingSharedIndividual
RoleGather infoRefine info

What it does

  • Processes the attention output
  • Learns deeper patterns
  • Refines the representation of each word

Think of it like a feature processing layer.

The Feed Forward Network processes the information learned from attention and improves the representation.

So, Transformer is made of multiple neural network components, mainly Self-Attention and Feed Forward layers, along with an embedding layer.

5. Output Representation

After these steps, the model produces contextual embeddings.

This means each word vector now depends on the entire sentence.

Example:

Sentence 1

He went to the bank to deposit money

Sentence 2

He sat near the bank of the river

The word bank will have different vectors in both sentences.

The output representation is a context-aware vector for each word.

We just learned about the Transformer architecture, which can understand relationships between words using self-attention and produce contextual word representations.
Researchers then used this architecture to build powerful language models.
One of the most important models is BERT.

Where BERT Is Used

  • Chatbots
  • Google search
  • Question answering systems
  • Sentiment analysis
  • Text classification