749 words
4 minutes
Introduction to NLP

NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language, it involves the development of algorithms and models that enable machines to understand, interpret, generate, and respond to human language. For example every time you ask Siri a question, autocomplete suggests your next word, or ChatGPT drafts an email for you, Natural Language Processing (NLP) is working behind the scenes.

In this blog, we’ll cover the basics of the following topics

  • Segmentation
  • Tokenization
  • Stemming
  • Lemmatization
  • Part-of-Speech (POS) tagging
  • Named Entity Recognition (NER)

Segmentation#

Segmentation is the process of dividing text into meaningful units such as sentences or paragraphs. It’s the first step in the NLP pipeline, breaking down large blocks of text into smaller, manageable pieces that can be analyzed individually.

How Does It Work?

Sentence segmentation typically looks for punctuation marks like periods (.), question marks (?), and exclamation points (!). However, it’s not as simple as just splitting on every period:

Example:

"Dr. Smith works at U.S. Tech Inc. He earns $50,000 annually. Does he enjoy his job?"

A naive approach would incorrectly split at “Dr.” and “U.S.” Modern segmentation algorithms are smart enough to recognize:

  • Abbreviations (Dr., Inc., U.S.)
  • Decimals in numbers (3.14)
  • Ellipses (…)
  • The difference between a period ending a sentence vs. other uses

Tokenization#

Tokenization is the process of breaking down text into smaller units called “tokens.” These tokens are typically words, but can also be subwords, characters, or phrases depending on the approach. Think of it as chopping a sentence into its individual building blocks.

Why is Tokenization Important?

Computers can’t understand text as humans do. Tokenization converts text into discrete units that machines can process and analyze. It’s a fundamental step that enables:

  • Counting word frequencies
  • Building vocabularies for machine learning models
  • Analyzing sentence structure
  • Feeding data into NLP models

Example

Don't forget: meet me at 3:30 p.m. tomorrow!

Simple Split

["Don't", "forget:", "meet", "me", "at", "3:30", "p.m.", "tomorrow!"]

Proper Tokenization:

["Do", "n't", "forget", ":", "meet", "me", "at", "3:30", "p.m.", "tomorrow", "!"]
NOTE

Dont confuse Segmentation and Tokenization, Segmentation breaks text into sentences or paragraphs (larger units), while Tokenization breaks text into words or subwords (smaller units)

Stemming#

Stemming is the process of reducing words to their root or base form by removing suffixes and prefixes. The goal is to group together different variations of a word so they can be treated as the same term. The resulting “stem” may not always be a valid dictionary word.

Why is Stemming Important?

In natural language, words appear in many different forms:

  • “run”, “running”, “runs”, “ran”
  • “connect”, “connected”, “connecting”, “connection”, “connections”

For many NLP tasks like search engines or text classification, these variations should be treated as the same concept. Stemming helps by:

  • Reducing vocabulary size
  • Improving search recall (finding more relevant results)
  • Normalizing text for analysis
  • Reducing computational complexity

Example (Porter Stemmer)

running -> run
runner -> run
runs -> run
fairly -> fair
fairness -> fair
connection -> connect
connected -> connect
connecting -> connect
studies -> studi
studying -> studi
NOTE

“studies” becomes “studi” (not a real word!) - this is normal in stemming.

Lemmatization#

Lemmatization is the process of reducing words to their base or dictionary form, called a “lemma.” Unlike stemming, lemmatization uses vocabulary and morphological analysis to return actual, valid words. It considers the context and part of speech to determine the correct base form.

Why is Lemmatization Important?

Lemmatization provides more accurate text normalization than stemming by:

  • Returning real dictionary words (lemmas)
  • Considering grammatical context
  • Preserving semantic meaning
  • Producing more interpretable results
Word            Stemming        Lemmatization
-----------------------------------------------
studies         studi           study
studying        studi           study
better          better          good
worse           wors            bad
running         run             run
ran             ran             run
is              is              be
are             are             be
am              am              be
caring          care            care
cares           care            care

Key Differences:

AspectStemmingLemmatization
OutputMay not be a real wordAlways a real word
MethodRule-based (chops affixes)Dictionary + grammar analysis
SpeedFasterSlower
AccuracyLowerHigher
ContextIgnores contextConsiders part of speech
Example”studies” → “studi""studies” → “study”

Part-of-Speech (POS) Tagging#

Part-of-Speech (POS) tagging is the process of assigning grammatical categories (like noun, verb, adjective, etc.) to each word in a sentence. It identifies the role each word plays based on its definition and context within the sentence.

Why is POS Tagging Important?

POS tags provide crucial grammatical information that helps machines understand:

  • The syntactic structure of sentences
  • Word relationships and dependencies
  • Disambiguation of word meanings (e.g., “book” as noun vs. verb)
  • Context for downstream NLP tasks

Examples

TagCategoryExamples
NNNouncat, book
VBVerbrun, eat
JJAdjectivehappy, blue
RBAdverbquickly, very
PRPPronounI, you, he
DTDeterminerthe, a, this

Named Entity Recognition (NER)#

Named Entity Recognition (NER) is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, quantities, and more.

Why is NER Important?

  • Extracts key information from unstructured text
  • Enables information retrieval and knowledge extraction
  • Powers search engines, chatbots, and recommendation systems
  • Essential for document summarization and question answering

Examples

Entity TypeLabelExamples
PersonPERSON”John Smith”, “Marie Curie”
OrganizationORG”Google”, “United Nations”
LocationGPE/LOC”Paris”, “Mount Everest”
DateDATE”January 1, 2025”, “yesterday”
TimeTIME”3:30 PM”, “morning”
MoneyMONEY”$100”, “50 euros”
PercentagePERCENT”25%”, “half”
ProductPRODUCT”iPhone”, “Tesla Model 3”

alt text