Skip gram

CBOW combines all context words together and predicts the center word.

But sometimes we want the opposite.

We may want to know:

Given a word, which words usually appear around it?

That leads to Skip-Gram.

Skip-Gram Idea

Skip-Gram does the reverse task.

Instead of predicting target from context, it predicts context from target.

So the direction changes.

CBOW      : Context → Target
Skip-Gram : Target  → Context

Definition

Skip-Gram predicts the surrounding context words using the target word.

Structure:

Input Word (One-hot)
        ↓
Embedding Layer
        ↓
Output Layer (Softmax)
        ↓
Predicted Context Word

So the model:

  1. Takes target word as input
  2. Converts it to embedding
  3. Predicts context words

                (Input Layer)

                   IS

               [0 0 1 0 0 0 0]

                      │

                      │  W¹ (7×5)

                     ▼

             ——————-

             | Hidden Layer (5D) |

             ——————-

                      │

                      │  W² (5×7)

                      ▼

             ——————-

             | Output Layer (7D) |

             ——————-

        NLP   NAME   IS   REL   TO   DATA   SCI

        ↑            ↑                  ↑     ↑

      (one correct context word at a time)

Example Sentence

NLP NAME IS RELATED TO DATA SCIENCE

Suppose target word = IS

Context words:

NLP, NAME, RELATED, TO


2️ Training Pairs in Skip-Gram

Skip-Gram creates separate training pairs.

Instead of predicting the center word, it predicts each surrounding word.

So the pairs become:

(IS → NLP)
(IS → NAME)
(IS → RELATED)
(IS → TO)

Each pair is treated as a separate training example.


Model takes IS and guesses all possible words.

WordProbability
NLP0.15
NAME0.18
IS0.05
RELATED0.22
TO0.20

Which one should be correct for (IS → NLP)?

NLP

If NLP is not highest → model made mistake

We use backpropagation to fix it

So, concept is here we train to

Increase NLP probability

Decrease others

After 1st Update (Iteration 1)

WordProbability
NLP0.25
NAME0.16
IS0.05
RELATED0.20
TO0.18 ⬇️
    

Even if NLP becomes highest, we don’t stop immediately

Why?

Because we want:

Not just highest
But VERY confident prediction

Compare:

Case 1 (weak learning)

NLP = 0.25, RELATED = 0.20

Difference is small → model is unsure

Case 2 (good learning)

NLP = 0.65, others much lower

Clear winner → model is confident

What training tries to achieve

  • Correct word → probability close to 1
  • Others → close to 0

After 2nd Update

WordProbability
NLP0.40
NAME0.12
IS0.04
RELATED0.18
TO0.16

After 3rd Update

WordProbability
NLP0.65
NAME0.08
IS0.03
RELATED0.12
TO0.12

What Actually Changed Internally

Weights in the network got adjusted

  • Input → embedding improved
  • Embedding → output mapping improved

So model now connects IS strongly with NLP

Then Move to Next Pair

Now train:

(IS → NAME)

Same process:

  • Increase NAME
  • Decrease others

After many such updates:

IS will have high probability for:

  • NLP
  • NAME
  • RELATED
  • TO

Repeat for All Pairs

(IS → NAME)

(IS → RELATED)

(IS → TO)

For the pair:

(IS →  NLP

Correct answer = NLP

The model adjusts weights so that:

P(NLP | IS) increases

3️


For the pair:

(IS →  NLP

Correct answer = NLP

The model adjusts weights so that:

P(NLP | IS) increases


Next training pair:

(IS → NAME)

Now correct answer = NAME

Model adjusts weights again so that:

P(NAME | IS) increases


5️ Important Idea

Skip-Gram does multiple predictions using the same target word.

IS → NL[
IS → NAME
IS → RELATED
IS → TO

Each one is a separate training example.