CBOW combines all context words together and predicts the center word.
But sometimes we want the opposite.
We may want to know:
Given a word, which words usually appear around it?
That leads to Skip-Gram.
Skip-Gram Idea
Skip-Gram does the reverse task.
Instead of predicting target from context, it predicts context from target.
So the direction changes.
CBOW : Context → Target
Skip-Gram : Target → Context
Definition
Skip-Gram predicts the surrounding context words using the target word.
Structure:
Input Word (One-hot)
↓
Embedding Layer
↓
Output Layer (Softmax)
↓
Predicted Context Word
So the model:
- Takes target word as input
- Converts it to embedding
- Predicts context words
(Input Layer)
IS
[0 0 1 0 0 0 0]
│
│ W¹ (7×5)
▼
——————-
| Hidden Layer (5D) |
——————-
│
│ W² (5×7)
▼
——————-
| Output Layer (7D) |
——————-
NLP NAME IS REL TO DATA SCI
↑ ↑ ↑ ↑
(one correct context word at a time)
Example Sentence
NLP NAME IS RELATED TO DATA SCIENCE
Suppose target word = IS
Context words:
NLP, NAME, RELATED, TO
2️⃣ Training Pairs in Skip-Gram
Skip-Gram creates separate training pairs.
Instead of predicting the center word, it predicts each surrounding word.
So the pairs become:
(IS → NLP)
(IS → NAME)
(IS → RELATED)
(IS → TO)
Each pair is treated as a separate training example.
Model takes IS and guesses all possible words.
| Word | Probability |
| NLP | 0.15 |
| NAME | 0.18 |
| IS | 0.05 |
| RELATED | 0.22 |
| TO | 0.20 |
Which one should be correct for (IS → NLP)?
NLP
If NLP is not highest → model made mistake
We use backpropagation to fix it
So, concept is here we train to
Increase NLP probability
Decrease others
After 1st Update (Iteration 1)
| Word | Probability |
| NLP | 0.25 |
| NAME | 0.16 |
| IS | 0.05 |
| RELATED | 0.20 |
| TO | 0.18 ⬇️ |
Even if NLP becomes highest, we don’t stop immediately
Why?
Because we want:
Not just highest
But VERY confident prediction
Compare:
Case 1 (weak learning)
NLP = 0.25, RELATED = 0.20
Difference is small → model is unsure
Case 2 (good learning)
NLP = 0.65, others much lower
Clear winner → model is confident
What training tries to achieve
- Correct word → probability close to 1
- Others → close to 0
After 2nd Update
| Word | Probability |
| NLP | 0.40 |
| NAME | 0.12 |
| IS | 0.04 |
| RELATED | 0.18 |
| TO | 0.16 |
After 3rd Update
| Word | Probability |
| NLP | 0.65 |
| NAME | 0.08 |
| IS | 0.03 |
| RELATED | 0.12 |
| TO | 0.12 |
What Actually Changed Internally
Weights in the network got adjusted
- Input → embedding improved
- Embedding → output mapping improved
So model now connects IS strongly with NLP
Then Move to Next Pair
Now train:
(IS → NAME)
Same process:
- Increase NAME
- Decrease others
After many such updates:
IS will have high probability for:
- NLP
- NAME
- RELATED
- TO
Repeat for All Pairs
(IS → NAME)
(IS → RELATED)
(IS → TO)
For the pair:
(IS → NLP
Correct answer = NLP
The model adjusts weights so that:
P(NLP | IS) increases
3️⃣
For the pair:
(IS → NLP
Correct answer = NLP
The model adjusts weights so that:
P(NLP | IS) increases
Next training pair:
(IS → NAME)
Now correct answer = NAME
Model adjusts weights again so that:
P(NAME | IS) increases
5️⃣ Important Idea
Skip-Gram does multiple predictions using the same target word.
IS → NL[
IS → NAME
IS → RELATED
IS → TO
Each one is a separate training example.
