Transformer

In Word2Vec we learned that each word gets only one fixed vector, so the model cannot understand the context of the sentence.

But humans understand that the meaning is different because we look at the entire sentence.

So researchers started asking:

Can a model understand the meaning of a word based on the full sentence?

This led to the development of the Transformer model.

The Transformer is a deep learning architecture that helps computers understand relationships between words in a sentence.

Complete Transformer Architecture

The full Transformer has two parts:

Encoder

Decoder

Input Sentence
      ↓
Encoder
      ↓
Decoder
      ↓
Output Sentence

Encoder

Purpose: Understand the input sentence

Contains:

  • Self-Attention
  • Feed Forward Network

Decoder

Purpose: Generate the output sentence

Example:

Input (English):

I love machine learning

Output (French):

J’aime l’apprentissage automatique

Decoder generates the output word by word.


Simple Comparison

PartRole
EncoderUnderstand input text
DecoderGenerate output text

But in many NLP tasks, we do not need to generate a new sentence.
We only need the model to understand the meaning of the given text.

Examples:

  • sentiment analysis
  • text classification

For these tasks, only the Encoder part is required.

Because of this, models like BERT use only the Transformer Encoder.
BERT focuses on understanding the context of words in a sentence, not generating new sentences.

The key idea of transformer encoder is something called Attention.

Attention

Example sentence:

The animal didn’t cross the street because it was too tired.

Here the word “it” refers to animal.

The model must learn which word is important.

The Attention mechanism helps the model focus on the most relevant words in the sentence.

Attention tells the model which words to focus on while understanding a sentence.

Core Idea of Transformer

Unlike older models like:

  • RNN
  • LSTM

which read sentences one word at a time, the Transformer reads the entire sentence at once.

Benefits:

  • Faster training
  • Better context understanding
  • Captures long-distance relationships

Basic Components of Transformer

Overall Flow

Input Sentence
      ↓
Word Embeddings
      ↓
Positional Encoding
      ↓
Self-Attention
      ↓
Feed Forward Neural Network
      ↓
Output Representation

1. Word Embedding

What it does

Computers cannot understand words directly, they only understand numbers.
So the first step is to convert each word into a vector (numbers).

Example sentence:

The cat drinks milk

After embedding:

WordVector (example)
The[0.21, 0.45, 0.67]
cat[0.11, 0.90, 0.32]
drinks[0.56, 0.12, 0.78]
milk[0.34, 0.67, 0.89]

These vectors capture semantic meaning of words.

Word Embedding converts words into numerical vectors so the model can process them.

2. Positional Encoding

Why it is needed

Transformers process all words at the same time, not sequentially like RNN/LSTM.

Because of that, the model does not automatically know the order of words.

Example:

Sentence 1

Dog bites man

Sentence 2

Man bites dog

Both contain the same words but different meaning.

So we add position information.

Example:

WordPosition
The1
cat2
drinks3
milk4

This position information is added to embeddings.

Positional Encoding tells the model the order of words in the sentence.

3. Self-Attention (Most Important Part)

This is the core idea of Transformers.

What it does

Self-attention helps the model understand which words in the sentence are important for each other.

Example sentence:

The boy ate the cake because he was hungry.

The word “he” refers to boy.

Self-attention allows the model to connect these related words, even if they are far apart.

Example attention relationship:

WordImportant words
heboy
cakeate
hungryboy

Self-Attention helps the model focus on important words in the sentence while understanding meaning.

4. Feed Forward Neural Network

After attention finds the relationships, the result goes through a small neural network.

What it does

  • Processes the attention output
  • Learns deeper patterns
  • Refines the representation of each word

Think of it like a feature processing layer.

The Feed Forward Network processes the information learned from attention and improves the representation.

5. Output Representation

After these steps, the model produces contextual embeddings.

This means each word vector now depends on the entire sentence.

Example:

Sentence 1

He went to the bank to deposit money

Sentence 2

He sat near the bank of the river

The word bank will have different vectors in both sentences. The output representation is a context-aware vector for each word.