TF-IDF gives higher weight to important words and lower weight to common words.
Bag of Words only counts.
TF-IDF counts + judges importance.
TF-IDF has TWO parts
TF — Term Frequency
(How often a word appears in a sentence)
TF tells us how important a word is inside one sentence.
Formula:
TF =
IDF — Inverse Document Frequency
How rare a word is across all sentences
IDF tells us how special a word is across the whole dataset.
If a word appears in every sentence, it’s not special.
If it appears in only one sentence, it’s very special.
formula:
IDF =
Example:
Suppose 3 sentences after step of preprocessing
S1 → good boy
S2 → good girl
S3 → boy girl good
Step 1: Build Vocabulary
Unique words:
{ good, boy, girl }
These are our features.
Step 2: Term Frequency (TF)
TF =
S1 = “good boy” (2 words)
- good → 1/2
- boy → 1/2
- girl → 0
S2 = “good girl” (2 words)
- good → 1/2
- boy → 0
- girl → 1/2
S3 = “boy girl good” (3 words)
- good → 1/3
- boy → 1/3
- girl → 1/3
TF Table
| Word | S1 | S2 | S3 |
| good | 1/2 | 1/2 | 1/3 |
| boy | 1/2 | 0 | 1/3 |
| girl | 0 | 1/2 | 1/3 |
Step 3: Inverse Document Frequency (IDF)
IDF =
Now we check:
In how many sentences does each word appear?
Total sentences = 3
S1 → good boy
S2 → good girl
S3 → boy girl good
Count sentence presence:
- good → appears in S1, S2, S3 → 3 sentences
- boy → appears in S1, S3 → 2 sentences
- girl → appears in S2, S3 → 2 sentences
IDF Formula :
IDF = log( Total sentences / Sentences containing the word )
IDF Values
| Word | Calculation | IDF |
| good | log(3/3) | 0 |
| boy | log(3/2) | > 0 |
| girl | log(3/2) | > 0 |
Important Observation
“good” appears everywhere → IDF = 0 (not special)
“boy” and “girl” appear less → higher IDF (more important)
So TF-IDF automatically says:
“good” is common → reduce its weight
“boy” and “girl” are rarer → increase their weight
Step 4: TF × IDF = TF-IDF
Now we multiply:
TF-IDF = TF × IDF
Final TF-IDF Table
| good | boy | girl | |
| S1 | 0 | (1/2) × log(3/2) | 0 |
| S2 | 0 | 0 | (1/2) × log(3/2) |
| S3 | 0 | (1/3) × log(3/2) | (1/3) × log(3/2) |
Explanation:
For S1 (“good boy”):
- good → TF × IDF = anything × 0 = 0
- boy → (1/2) × log(3/2) → positive value
- girl → 0
So S1 becomes:
[0 , important , 0]
Meaning:
👉 “boy” is the key word.
For S2 (“good girl”):
[0 , 0 , important]
👉 “girl” is the key word.
For S3 (“boy girl good”):
[0 , small , small]
Because TF is smaller (1/3).
Main Idea of Final TF-IDF
TF-IDF removes common words and highlights rare, meaningful words.
“good” disappears
“boy” and “girl” become important
