NLP Techniques Every Data Scientist Should Know

Natural Language Processing (NLP) is a critical field in data science, especially with the growth in data generated from online sources like social media, reviews, and more. It doesn’t matter if you are looking for a career in NLP or not, there are some NLP techniques every Data Scientist should know while working with textual datasets. So, if you want to learn about the essential NLP techniques you should know, this article is for you. In this article, I’ll take you through some NLP techniques every Data Scientist should know with implementation using Python.

NLP Techniques Every Data Scientist Should Know

Here are some NLP techniques that every Data Scientist should know while working with textual datasets:

Tokenization
Stop words removal
Stemming and Lemmatization
Named Entity Recognition
Term Frequency-Inverse Document Frequency
Bag of Words

Let’s explore all these NLP techniques in detail with implementation using Python.

Tokenization

Tokenization is the process of breaking down text into smaller pieces, called tokens, which could be words, sentences, or other units. It’s often the first step in text preprocessing for tasks like sentiment analysis or topic modelling.

It converts text into a structured form that algorithms can manipulate.

Here’s how to implement tokenization using Python:

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

text = "Hi, my name is Aman Kharwal"
tokens = word_tokenize(text)
print(tokens)

['Hi', ',', 'my', 'name', 'is', 'Aman', 'Kharwal']

Stop Words Removal

Stop words are common words like “and”, “the”, “a”, which often don’t contribute much to the meaning of a sentence, particularly in tasks like sentiment analysis or topic modeling.

Removing these can reduce the dataset size and improve the processing time.

Here’s how to remove stop words from a piece of text using Python:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

text = "Hi, my name is Aman Kharwal."
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if not word in stop_words]
print(filtered_tokens)

['Hi', ',', 'name', 'Aman', 'Kharwal', '.']

Stemming and Lemmatization

These techniques are used to reduce words to their base or root form. Stemming might not always produce actual words but cuts off prefixes and suffixes (e.g., “running” becomes “run”).

On the other hand, Lemmatization reduces words to their lexicographically correct base form based on their usage in a sentence (e.g., “better” becomes “good” when used as an adjective).

These methods are useful in search engines and recommendation systems.

Here’s how to implement Stemming and Lemmatization using Python:

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

text = "running runs ran"
tokens = word_tokenize(text)

stemmed_words = [stemmer.stem(word) for word in tokens]
lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]

print("Stemmed:", stemmed_words)
print("Lemmatized:", lemmatized_words)

Stemmed: ['run', 'run', 'ran']
Lemmatized: ['running', 'run', 'ran']

Named Entity Recognition (NER)

NER identifies and classifies named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, etc.

It is crucial for data extraction in business intelligence, summarization, and more.

Here’s how to identify named entities from a piece of text using Python:

import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

text = "Hi, my name is Aman Kharwal, I work at Statso.io"
doc = nlp(text)

displacy.render(doc, style='ent', jupyter=True, options={'ents': ['PERSON', 'ORG'], 'colors': {'PERSON': 'lightblue', 'ORG': 'lime'}})

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It is often used in document search and information retrieval, helping to determine which documents are most relevant to a query based on the words they contain.

This technique is crucial for feature extraction in machine learning models for text classification.

Here’s how to implement TF-IDF using Python:

from sklearn.feature_extraction.text import TfidfVectorizer

# sample documents
documents = [
    "My name is Aman Kharwal",
    "I work at Statso.io",
    "We are learning NLP Techniques today!"
]

# create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# view the TF-IDF values for the first document
feature_names = vectorizer.get_feature_names_out()
first_document_vector = tfidf_matrix[0]

print("Feature names:", feature_names)
print("TF-IDF values for the first document:")
print(first_document_vector.toarray())

Feature names: ['aman' 'are' 'at' 'io' 'is' 'kharwal' 'learning' 'my' 'name' 'nlp'
 'statso' 'techniques' 'today' 'we' 'work']
TF-IDF values for the first document:
[[0.4472136 0.        0.        0.        0.4472136 0.4472136 0.
  0.4472136 0.4472136 0.        0.        0.        0.        0.
  0.       ]]

Bag of Words

The Bag of Words model is a simplified representation used in NLP and information retrieval. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

BoW is commonly used in document classification where the frequency of each word is used as a feature for training a classifier.

Here’s how to use the Bag of Words model using Python:

from sklearn.feature_extraction.text import CountVectorizer

# sample documents
documents = [
    "the cat is on the table",
    "the dog is in the house",
    "cats and dogs are pets"
]

# create a CountVectorizer object
vectorizer = CountVectorizer()

# fit and transform the documents
bow_matrix = vectorizer.fit_transform(documents)

# get the feature names
feature_names = vectorizer.get_feature_names_out()

# convert the BoW matrix into an array and print it
bow_array = bow_matrix.toarray()
print("Feature names:", feature_names)
print("Bag of Words Array:")
print(bow_array)

Feature names: ['and' 'are' 'cat' 'cats' 'dog' 'dogs' 'house' 'in' 'is' 'on' 'pets'
 'table' 'the']
Bag of Words Array:
[[0 0 1 0 0 0 0 0 1 1 0 1 2]
 [0 0 0 0 1 0 1 1 1 0 0 0 2]
 [1 1 0 1 0 1 0 0 0 0 1 0 0]]

Summary

So, below are some NLP techniques that every Data Scientist should know while working with textual datasets:

Tokenization
Stop words removal
Stemming and Lemmatization
Named Entity Recognition
Term Frequency-Inverse Document Frequency
Bag of Words

I hope you liked this article on NLP techniques every Data Scientist should know with implementation using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.