What is Bag of Words?
Bag of Words is a text representation technique that converts sentences into numerical vectors based on word frequency.
Consider one example,
Original Sentences (Dataset)
S1: He is a good boy
S2: She is a good girl
S3: Boy and girl are good
Step 1: Text Processing (Cleaning)
We:
- convert to lowercase
- remove stopwords like he, she, is, a, and, are
After preprocessing:
S1 → good boy
S2 → good girl
S3 → boy girl good
Step 2: Create Vocabulary (Unique Words)
Collect all unique words:
[good, boy, girl]
This is our vocabulary.
Vocabulary size = 3
These become our columns (features).
Vocabulary = all different words present in the dataset.
These become the columns (features).
Step 3: Count Overall Word Frequency
Now we count how many times each word appears in the entire dataset:
| Word | Frequency |
| good | 3 |
| boy | 2 |
| girl | 2 |
“good” appears in S1, S2, S3 → 3 times
“boy” appears in S1, S3 → 2 times
“girl” appears in S2, S3 → 2 times
Step 3: Create BoW Vectors (Normal BoW)
Now we count how many times each word appears in every sentence.
Bag of Words = Count the words, ignore the order.
Sentence → Word counts → Vector
Vocabulary order:
good | boy | girl
S1: “good boy”
good = 1
boy = 1
girl = 0
Vector:
[1 1 0]
S2: “good girl”
good = 1
boy = 0
girl = 1
Vector:
[1 0 1]
S3: “boy girl good”
good = 1
boy = 1
girl = 1
Vector:
[1 1 1]
Final Bag of Words Matrix
| Sentence | good | boy | girl |
| S1 | 1 | 1 | 0 |
| S2 | 1 | 0 | 1 |
| S3 | 1 | 1 | 1 |
Each row = one sentence
Each column = one word
Values = word count
Binary Bag of Words (Binary BoW)
Instead of counting frequency, we use:
- 1 → word present
- 0 → word absent
So we don’t care how many times the word appears — only whether it appears.
Example (Binary BoW):
S1 → “good boy”
[1 1 0]
S2 → “good girl”
[1 0 1]
S3 → “boy girl good”
[1 1 1]
Even if “good” appears many times, it is still just 1.
Difference Between Normal BoW and Binary BoW
| Type | What it stores |
| Normal BoW | Word counts (frequency) |
| Binary BoW | Only presence (1/0) |
- Binary BoW → Is the word there?
- Normal BoW → How many times is the word there?
