Basic Terminologies in NLP

Let’s take one small example

Example Text:

“I love AI. AI is amazing.”

We will use this to understand everything.

  1. Corpus

Meaning:

Corpus = Collection of all text data

In simple words:

A corpus is the full set of paragraphs / documents we give to the computer.

Example:

If you have 100 movie reviews, all together they form a corpus.

From our example:

“I love AI. AI is amazing.”

This full text = Corpus

  • Documents (or Sentences)

Meaning:

Each individual sentence or review inside the corpus.

In simple words:

Document = One data item.

From our example:

  • Document 1: I love AI
  • Document 2: AI is amazing

Each sentence is one document.

  • Vocabulary

Meaning:

All unique words present in the corpus.

From example:

Text:

I love AI. AI is amazing.

Words:

I, love, AI, is, amazing

Even though AI appears twice, we count it only once.

So vocabulary =

{I, love, AI, is, amazing}

  • Words (Tokens)

Meaning:

Individual words after splitting sentences.

From:

“I love AI”

Words are:

👉 I
👉 love
👉 AI

These smallest units are called tokens.

Simple Summary

TermMeaning
CorpusEntire text collection
DocumentOne sentence/review
VocabularyUnique words
WordsIndividual tokens

In short

Corpus contains Documents.
Documents contain Words.
Vocabulary is unique Words.