How Bag of Words (BoW)Works in NLP (Step-by-Step Example)

What is Bag of Words?

Bag of Words is a text representation technique that converts sentences into numerical vectors based on word frequency.

Consider one example,

Original Sentences (Dataset)

S1: He is a good boy
S2: She is a good girl
S3: Boy and girl are good

Step 1: Text Processing (Cleaning)

We:

  • convert to lowercase
  • remove stopwords like he, she, is, a, and, are

After preprocessing:

S1 → good boy
S2 → good girl
S3 → boy girl good

Step 2: Create Vocabulary (Unique Words)

Collect all unique words:

[good, boy, girl]

This is our vocabulary.

Vocabulary size = 3

These become our columns (features).

Vocabulary = all different words present in the dataset.
These become the columns (features).

Step 3: Count Overall Word Frequency

Now we count how many times each word appears in the entire dataset:

WordFrequency
good3
boy2
girl2

“good” appears in S1, S2, S3 → 3 times

“boy” appears in S1, S3 → 2 times

“girl” appears in S2, S3 → 2 times

Step 3: Create BoW Vectors (Normal BoW)

Now we count how many times each word appears in every sentence.

Bag of Words = Count the words, ignore the order.

Sentence → Word counts → Vector

Vocabulary order:

good | boy | girl

S1: “good boy”

good = 1

boy  = 1

girl = 0

Vector:

[1 1 0]


S2: “good girl”

good = 1

boy  = 0

girl = 1

Vector:

[1 0 1]


S3: “boy girl good”

good = 1

boy  = 1

girl = 1

Vector:

[1 1 1]

Final Bag of Words Matrix

Sentencegoodboygirl
S1110
S2101
S3111

Each row = one sentence
Each column = one word
Values = word count

Binary Bag of Words (Binary BoW)

Instead of counting frequency, we use:

  • 1 → word present
  • 0 → word absent

So we don’t care how many times the word appears — only whether it appears.

Example (Binary BoW):

S1 → “good boy”

[1 1 0]

S2 → “good girl”

[1 0 1]

S3 → “boy girl good”

[1 1 1]

Even if “good” appears many times, it is still just 1.

Difference Between Normal BoW and Binary BoW

TypeWhat it stores
Normal BoWWord counts (frequency)
Binary BoWOnly presence (1/0)
  • Binary BoW → Is the word there?
  • Normal BoW → How many times is the word there?