Skip gram - AI Knowledge Hub

CBOW combines all context words together and predicts the center word.

But sometimes we want the opposite.

We may want to know:

Given a word, which words usually appear around it?

That leads to Skip-Gram.

Skip-Gram Idea

Skip-Gram does the reverse task.

Instead of predicting target from context, it predicts context from target.

So the direction changes.

CBOW : Context → Target
Skip-Gram : Target → Context

Definition

Skip-Gram predicts the surrounding context words using the target word.

Structure:

Input Word (One-hot)
        ↓
Embedding Layer
        ↓
Output Layer (Softmax)
        ↓
Predicted Context Word

So the model:

Takes target word as input
Converts it to embedding
Predicts context words

(Input Layer)

[0 0 1 0 0 0 0]

│

│ W¹ (7×5)

▼

——————-

| Hidden Layer (5D) |

——————-

│

│ W² (5×7)

▼

——————-

| Output Layer (7D) |

——————-

NLP NAME IS REL TO DATA SCI

↑ ↑ ↑ ↑

(one correct context word at a time)

Example Sentence

NLP NAME IS RELATED TO DATA SCIENCE

Suppose target word = IS

Context words:

NLP, NAME, RELATED, TO

2️⃣ Training Pairs in Skip-Gram

Skip-Gram creates separate training pairs.

Instead of predicting the center word, it predicts each surrounding word.

So the pairs become:

(IS → NLP)
(IS → NAME)
(IS → RELATED)
(IS → TO)

Each pair is treated as a separate training example.

Model takes IS and guesses all possible words.

Word	Probability
NLP	0.15
NAME	0.18
IS	0.05
RELATED	0.22
TO	0.20

Which one should be correct for (IS → NLP)?

NLP

If NLP is not highest → model made mistake

We use backpropagation to fix it

So, concept is here we train to

Increase NLP probability

Decrease others

After 1st Update (Iteration 1)

Word	Probability
NLP	0.25
NAME	0.16
IS	0.05
RELATED	0.20
TO	0.18 ⬇️

Even if NLP becomes highest, we don’t stop immediately

Why?

Because we want:

Not just highest
But VERY confident prediction

Compare:

Case 1 (weak learning)

NLP = 0.25, RELATED = 0.20

Difference is small → model is unsure

Case 2 (good learning)

NLP = 0.65, others much lower

Clear winner → model is confident

What training tries to achieve

Correct word → probability close to 1
Others → close to 0

After 2nd Update

Word	Probability
NLP	0.40
NAME	0.12
IS	0.04
RELATED	0.18
TO	0.16

After 3rd Update

Word	Probability
NLP	0.65
NAME	0.08
IS	0.03
RELATED	0.12
TO	0.12

What Actually Changed Internally

Weights in the network got adjusted

Input → embedding improved
Embedding → output mapping improved

So model now connects IS strongly with NLP

Then Move to Next Pair

Now train:

(IS → NAME)

Same process:

Increase NAME
Decrease others

After many such updates:

IS will have high probability for:

NLP
NAME
RELATED
TO

Repeat for All Pairs

(IS → NAME)

(IS → RELATED)

(IS → TO)

For the pair:

(IS → NLP

Correct answer = NLP

The model adjusts weights so that:

P(NLP | IS) increases

3️⃣

For the pair:

(IS → NLP

Correct answer = NLP

The model adjusts weights so that:

P(NLP | IS) increases

Next training pair:

(IS → NAME)

Now correct answer = NAME

Model adjusts weights again so that:

P(NAME | IS) increases

5️⃣ Important Idea

Skip-Gram does multiple predictions using the same target word.

IS → NL[
IS → NAME
IS → RELATED
IS → TO

Each one is a separate training example.