Let’s take one small example
Example Text:
“I love AI. AI is amazing.”
We will use this to understand everything.
- Corpus
Meaning:
Corpus = Collection of all text data
In simple words:
A corpus is the full set of paragraphs / documents we give to the computer.
Example:
If you have 100 movie reviews, all together they form a corpus.
From our example:
“I love AI. AI is amazing.”
This full text = Corpus
- Documents (or Sentences)
Meaning:
Each individual sentence or review inside the corpus.
In simple words:
Document = One data item.
From our example:
- Document 1: I love AI
- Document 2: AI is amazing
Each sentence is one document.
- Vocabulary
Meaning:
All unique words present in the corpus.
From example:
Text:
I love AI. AI is amazing.
Words:
I, love, AI, is, amazing
Even though AI appears twice, we count it only once.
So vocabulary =
{I, love, AI, is, amazing}
- Words (Tokens)
Meaning:
Individual words after splitting sentences.
From:
“I love AI”
Words are:
👉 I
👉 love
👉 AI
These smallest units are called tokens.
Simple Summary
| Term | Meaning |
| Corpus | Entire text collection |
| Document | One sentence/review |
| Vocabulary | Unique words |
| Words | Individual tokens |
In short
Corpus contains Documents.
Documents contain Words.
Vocabulary is unique Words.
