Text Processing in NLP

Text Processing (also called Text Preprocessing) is the first and most important step in Natural Language Processing (NLP).

Text Processing is the process of cleaning and preparing raw text so that computers can understand and analyze it.

Human language contains noise such as symbols, spelling mistakes, emojis, and extra words.
Machines cannot understand this directly — so we clean it step by step.

Why Text Processing is Needed

Human text contains:

  • Capital letters
  • Symbols (!, ?, #)
  • Extra words (is, am, the…)
  • Different word forms (playing, played)

Consider this sentence:

“I’m VERY happy today!!! 😄

Humans understand emotion instantly.
Computers see only characters.

Complete Text Processing Pipeline

Text Cleaning

What happens here?

We remove:

  • Capital letters
  • Punctuation
  • Numbers
  • Special characters
  • HTML tags

Everything is converted to lowercase.


Example:

Original:

I’m VERY happy today!!! 😄

After cleaning:

Im very happy today

This cleaned corpus is now ready for tokenization.

That is Text Cleaning removes noise from raw text.

Tokenization

Tokenization = splitting sentences into individual words

Example:

im very happy today

Becomes:

[“im”, “very”, “happy”, “today”]

Each word is called a token.

Tokenization breaks text into words.

Stop Words Removal

Some words appear very frequently but add little meaning:

  • is
  • am
  • the
  • and
  • very

These are called stop words.

Example:

[“im”, “very”, “happy”, “today”]

After removing stopwords:

[“happy”, “today”]

Stop word removal keeps only meaningful words.

Stemming and Lemmatization

Now we reduce words to their base form.

In NLP, the same word can appear in many forms:

  • play, playing, played
  • history, historical
  • final, finally, finalized

If we treat all these as different, the computer gets confused.

So we convert them to a common base form.

That process is done using:

✅ Stemming
✅ Lemmatization

Both aim to reduce words, but they work differently.

Stemming

Stemming is the process of cutting off word endings to get a root form (stem).

It does NOT care if the result is a real English word.

It simply chops letters.

Original WordStem
historyhistori
historicalhistori
finallyfinal
playingplay
lovedlove

Output may not be grammatically correct. It is Fast, Simple, Output may be incorrect English

Lemmatization

Lemmatization converts words into their meaningful base form (called lemma).

It always gives a proper dictionary word.

It understands grammar and meaning.

Original WordLemma
historyhistory
historicalhistory
finallyfinal
finalizedfinal
bettergood
runningrun

More accurate than stemming.

Stemming and lemmatization convert words to root form.

All results are real English words.

So, Lemmatization is Accurate, Meaningful, Slower than stemming

Main Difference

FeatureStemmingLemmatization
MethodCuts wordsUses dictionary
ResultMay not be real wordAlways real word
AccuracyLowHigh
SpeedFastSlower
IntelligenceMechanicalMeaning-based
Examplehistory → historihistory → history

Simple Example

Sentence:

Finally the historical movie was better.

After stemming:

final histori mov better

After lemmatization:

final history movie good

Handling Contractions

People write:

  • don’t
  • can’t
  • I’m

These are contractions.

We expand them:

  • don’t → do not
  • I’m → I am

Example:

im happy

Becomes:

i am happy

Contraction handling makes text clear and complete.

Handling Emojis and Emoticons

Text may contain emojis:

😄 ❤️ 😢

We convert them into words.

Example:

happy 😄

Becomes:

happy :grinning_face:

Emojis are converted into readable text.

Spell Checking

People make spelling mistakes:

  • gud → good
  • happi → happy

Spell checking fixes this.

Example:

[“happi”, “tody”]

Becomes:

[“happy”, “today”]

Spell checking improves text quality.

Final Output After All Steps

Original:

I’m VERY happy today!!! 😄

After full processing:

[“happy”, “today”]

Now this clean data is ready for:

Word Embeddings
Machine Learning
Sentiment Analysis