749 words

4 minutes

Introduction to NLP

2025-10-25

NLP

beginner

/

guide

/

nlp

NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language, it involves the development of algorithms and models that enable machines to understand, interpret, generate, and respond to human language. For example every time you ask Siri a question, autocomplete suggests your next word, or ChatGPT drafts an email for you, Natural Language Processing (NLP) is working behind the scenes.

In this blog, we’ll cover the basics of the following topics

Segmentation
Tokenization
Stemming
Lemmatization
Part-of-Speech (POS) tagging
Named Entity Recognition (NER)

Segmentation#

Segmentation is the process of dividing text into meaningful units such as sentences or paragraphs. It’s the first step in the NLP pipeline, breaking down large blocks of text into smaller, manageable pieces that can be analyzed individually.

How Does It Work?

Sentence segmentation typically looks for punctuation marks like periods (.), question marks (?), and exclamation points (!). However, it’s not as simple as just splitting on every period:

Example:

"Dr. Smith works at U.S. Tech Inc. He earns $50,000 annually. Does he enjoy his job?"

A naive approach would incorrectly split at “Dr.” and “U.S.” Modern segmentation algorithms are smart enough to recognize:

Abbreviations (Dr., Inc., U.S.)
Decimals in numbers (3.14)
Ellipses (…)
The difference between a period ending a sentence vs. other uses

Tokenization#

Tokenization is the process of breaking down text into smaller units called “tokens.” These tokens are typically words, but can also be subwords, characters, or phrases depending on the approach. Think of it as chopping a sentence into its individual building blocks.

Why is Tokenization Important?

Computers can’t understand text as humans do. Tokenization converts text into discrete units that machines can process and analyze. It’s a fundamental step that enables:

Counting word frequencies
Building vocabularies for machine learning models
Analyzing sentence structure
Feeding data into NLP models

Example

Don't forget: meet me at 3:30 p.m. tomorrow!

Simple Split

["Don't", "forget:", "meet", "me", "at", "3:30", "p.m.", "tomorrow!"]

Proper Tokenization:

["Do", "n't", "forget", ":", "meet", "me", "at", "3:30", "p.m.", "tomorrow", "!"]

NOTE
Dont confuse Segmentation and Tokenization, Segmentation breaks text into sentences or paragraphs (larger units), while Tokenization breaks text into words or subwords (smaller units)

Stemming#

Stemming is the process of reducing words to their root or base form by removing suffixes and prefixes. The goal is to group together different variations of a word so they can be treated as the same term. The resulting “stem” may not always be a valid dictionary word.

Why is Stemming Important?

In natural language, words appear in many different forms:

“run”, “running”, “runs”, “ran”
“connect”, “connected”, “connecting”, “connection”, “connections”

For many NLP tasks like search engines or text classification, these variations should be treated as the same concept. Stemming helps by:

Reducing vocabulary size
Improving search recall (finding more relevant results)
Normalizing text for analysis
Reducing computational complexity

Example (Porter Stemmer)

running -> run
runner -> run
runs -> run
fairly -> fair
fairness -> fair
connection -> connect
connected -> connect
connecting -> connect
studies -> studi
studying -> studi

NOTE
“studies” becomes “studi” (not a real word!) - this is normal in stemming.

Lemmatization#

Lemmatization is the process of reducing words to their base or dictionary form, called a “lemma.” Unlike stemming, lemmatization uses vocabulary and morphological analysis to return actual, valid words. It considers the context and part of speech to determine the correct base form.

Why is Lemmatization Important?

Lemmatization provides more accurate text normalization than stemming by:

Returning real dictionary words (lemmas)
Considering grammatical context
Preserving semantic meaning
Producing more interpretable results

Word            Stemming        Lemmatization
-----------------------------------------------
studies         studi           study
studying        studi           study
better          better          good
worse           wors            bad
running         run             run
ran             ran             run
is              is              be
are             are             be
am              am              be
caring          care            care
cares           care            care

Key Differences:

Aspect	Stemming	Lemmatization
Output	May not be a real word	Always a real word
Method	Rule-based (chops affixes)	Dictionary + grammar analysis
Speed	Faster	Slower
Accuracy	Lower	Higher
Context	Ignores context	Considers part of speech
Example	”studies” → “studi"	"studies” → “study”

Part-of-Speech (POS) Tagging#

Part-of-Speech (POS) tagging is the process of assigning grammatical categories (like noun, verb, adjective, etc.) to each word in a sentence. It identifies the role each word plays based on its definition and context within the sentence.

Why is POS Tagging Important?

POS tags provide crucial grammatical information that helps machines understand:

The syntactic structure of sentences
Word relationships and dependencies
Disambiguation of word meanings (e.g., “book” as noun vs. verb)
Context for downstream NLP tasks

Examples

Tag	Category	Examples
NN	Noun	cat, book
VB	Verb	run, eat
JJ	Adjective	happy, blue
RB	Adverb	quickly, very
PRP	Pronoun	I, you, he
DT	Determiner	the, a, this

Named Entity Recognition (NER)#

Named Entity Recognition (NER) is the process of identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, dates, quantities, and more.

Why is NER Important?

Extracts key information from unstructured text
Enables information retrieval and knowledge extraction
Powers search engines, chatbots, and recommendation systems
Essential for document summarization and question answering

Examples

Entity Type	Label	Examples
Person	PERSON	”John Smith”, “Marie Curie”
Organization	ORG	”Google”, “United Nations”
Location	GPE/LOC	”Paris”, “Mount Everest”
Date	DATE	”January 1, 2025”, “yesterday”
Time	TIME	”3:30 PM”, “morning”
Money	MONEY	”$100”, “50 euros”
Percentage	PERCENT	”25%”, “half”
Product	PRODUCT	”iPhone”, “Tesla Model 3”

alt text

Introduction to Deep Learning

Introduction to Google Colab

Part-of-Speech (POS) Tagging

6

Named Entity Recognition (NER)