NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language, it involves the development of algorithms and models that enable machines to understand, interpret, generate, and respond to human language. For example every time you ask Siri a question, autocomplete suggests your next word, or ChatGPT drafts an email for you, Natural Language Processing (NLP) is working behind the scenes.
In this blog, we’ll cover the basics of the following topics
- Segmentation
- Tokenization
- Stemming
- Lemmatization
- Part-of-Speech (POS) tagging
- Named Entity Recognition (NER)
Segmentation
Segmentation is the process of dividing text into meaningful units such as sentences or paragraphs. It’s the first step in the NLP pipeline, breaking down large blocks of text into smaller, manageable pieces that can be analyzed individually.
How Does It Work?
Sentence segmentation typically looks for punctuation marks like periods (.), question marks (?), and exclamation points (!). However, it’s not as simple as just splitting on every period:
Example:
"Dr. Smith works at U.S. Tech Inc. He earns $50,000 annually. Does he enjoy his job?"A naive approach would incorrectly split at “Dr.” and “U.S.” Modern segmentation algorithms are smart enough to recognize:
- Abbreviations (Dr., Inc., U.S.)
- Decimals in numbers (3.14)
- Ellipses (…)
- The difference between a period ending a sentence vs. other uses
Tokenization
Tokenization is the process of breaking down text into smaller units called “tokens.” These tokens are typically words, but can also be subwords, characters, or phrases depending on the approach. Think of it as chopping a sentence into its individual building blocks.
Why is Tokenization Important?
Computers can’t understand text as humans do. Tokenization converts text into discrete units that machines can process and analyze. It’s a fundamental step that enables:
- Counting word frequencies
- Building vocabularies for machine learning models
- Analyzing sentence structure
- Feeding data into NLP models
Example
Don't forget: meet me at 3:30 p.m. tomorrow!Simple Split
["Don't", "forget:", "meet", "me", "at", "3:30", "p.m.", "tomorrow!"]Proper Tokenization:
["Do", "n't", "forget", ":", "meet", "me", "at", "3:30", "p.m.", "tomorrow", "!"]NOTEDont confuse Segmentation and Tokenization, Segmentation breaks text into sentences or paragraphs (larger units), while Tokenization breaks text into words or subwords (smaller units)
Stemming
Stemming is the process of reducing words to their root or base form by removing suffixes and prefixes. The goal is to group together different variations of a word so they can be treated as the same term. The resulting “stem” may not always be a valid dictionary word.
Why is Stemming Important?
In natural language, words appear in many different forms:
- “run”, “running”, “runs”, “ran”
- “connect”, “connected”, “connecting”, “connection”, “connections”
For many NLP tasks like search engines or text classification, these variations should be treated as the same concept. Stemming helps by:
- Reducing vocabulary size
- Improving search recall (finding more relevant results)
- Normalizing text for analysis
- Reducing computational complexity
Example (Porter Stemmer)
running -> run
runner -> run
runs -> run
fairly -> fair
fairness -> fair
connection -> connect
connected -> connect
connecting -> connect
studies -> studi
studying -> studiNOTE“studies” becomes “studi” (not a real word!) - this is normal in stemming.
Lemmatization
Lemmatization is the process of reducing words to their base or dictionary form, called a “lemma.” Unlike stemming, lemmatization uses vocabulary and morphological analysis to return actual, valid words. It considers the context and part of speech to determine the correct base form.
Why is Lemmatization Important?
Lemmatization provides more accurate text normalization than stemming by:
- Returning real dictionary words (lemmas)
- Considering grammatical context
- Preserving semantic meaning
- Producing more interpretable results
Word Stemming Lemmatization
-----------------------------------------------
studies studi study
studying studi study
better better good
worse wors bad
running run run
ran ran run
is is be
are are be
am am be
caring care care
cares care careKey Differences:
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Output | May not be a real word | Always a real word |
| Method | Rule-based (chops affixes) | Dictionary + grammar analysis |
| Speed | Faster | Slower |
| Accuracy | Lower | Higher |
| Context | Ignores context | Considers part of speech |
| Example | ”studies” → “studi" | "studies” → “study” |
Part-of-Speech (POS) Tagging
Part-of-Speech (POS) tagging is the process of assigning grammatical categories (like noun, verb, adjective, etc.) to each word in a sentence. It identifies the role each word plays based on its definition and context within the sentence.
Why is POS Tagging Important?
POS tags provide crucial grammatical information that helps machines understand:
- The syntactic structure of sentences
- Word relationships and dependencies
- Disambiguation of word meanings (e.g., “book” as noun vs. verb)
- Context for downstream NLP tasks
Examples
| Tag | Category | Examples |
|---|---|---|
| NN | Noun | cat, book |
| VB | Verb | run, eat |
| JJ | Adjective | happy, blue |
| RB | Adverb | quickly, very |
| PRP | Pronoun | I, you, he |
| DT | Determiner | the, a, this |
Named Entity Recognition (NER)
Named Entity Recognition (NER) is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, quantities, and more.
Why is NER Important?
- Extracts key information from unstructured text
- Enables information retrieval and knowledge extraction
- Powers search engines, chatbots, and recommendation systems
- Essential for document summarization and question answering
Examples
| Entity Type | Label | Examples |
|---|---|---|
| Person | PERSON | ”John Smith”, “Marie Curie” |
| Organization | ORG | ”Google”, “United Nations” |
| Location | GPE/LOC | ”Paris”, “Mount Everest” |
| Date | DATE | ”January 1, 2025”, “yesterday” |
| Time | TIME | ”3:30 PM”, “morning” |
| Money | MONEY | ”$100”, “50 euros” |
| Percentage | PERCENT | ”25%”, “half” |
| Product | PRODUCT | ”iPhone”, “Tesla Model 3” |
