Natural Language Processing (NLP) is a subset of Artificial Intelligence where we aim to train computers to understand human languages. Some real-world applications of NLP are chatbots, Siri, and Google Translator. While working on any problem based on NLP, we should follow a process to prepare a vocabulary of words from a textual dataset. So, if you want to understand the process of solving any problem based on NLP, this article is for you. In this article, I will take you through the complete process of NLP using Python.
Process of NLP
To explain the process of NLP, I will take you through the sentiment classification task using Python. The steps to solve this NLP problem are:
- Finding a dataset for sentiment classification
- Preparing the dataset by tokenization, stopwords removal, and stemming
- Text vectorization
- Training a classification model for sentiment classification
Process of NLP using Python
Step 1: Finding a Dataset
The first step while working on any NLP problem is to find a textual dataset. In this problem, we need to find a dataset containing text about the sentiments of people towards a product or service. If the dataset you found is labelled, it’s perfect! If you found an unlabelled textual dataset, you can learn how to add labels to a dataset for sentiment classification from here.
I found an ideal dataset based on movie reviews for the sentiment classification task on Kaggle. You can download the dataset from here.
As we have found a dataset for sentiment classification, let’s move further by importing the necessary Python libraries and the dataset:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
import nltk
nltk.download('stopwords')
data = pd.read_csv("IMDB Dataset.csv")
print(data.head())review sentiment 0 One of the other reviewers has mentioned that ... positive 1 A wonderful little production. <br /><br />The... positive 2 I thought this was a wonderful way to spend ti... positive 3 Basically there's a family where a little boy ... negative 4 Petter Mattei's "Love in the Time of Money" is... positive
Step 2: Data Preparation, Tokenization, Stopwords Removal and Stemming
Our textual dataset needs preparation before being used for any problem based on NLP. Here we will:
- remove links and all the special characters from the review column
- tokenize and remove the stopwords from the review column
- stem the words in the review column
import nltk
import re
nltk.download('stopwords')
stemmer = nltk.SnowballStemmer("english")
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))
def clean(text):
text = str(text).lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('\w*\d\w*', '', text)
text = [word for word in text.split(' ') if word not in stopword]
text=" ".join(text)
text = [stemmer.stem(word) for word in text.split(' ')]
text=" ".join(text)
return text
data["review"] = data["review"].apply(clean)Before moving forward, let’s have a quick look at the wordcloud of the review column:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
text = " ".join(i for i in data.review)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.figure( figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
Step 3: Text Vectorization
The next step is text vectorization. It means to transform all the text tokens into numerical vectors. Here I will first perform text vectorization on the feature column (review column) and then split the data into training and test sets:
x = np.array(data["review"])
y = np.array(data["sentiment"])
cv = CountVectorizer()
X = cv.fit_transform(x)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=42)Step 4: Text Classification
The final step in the process of NLP is to classify or cluster texts. As we are working on the problem of sentiment classification, we will now train a text classification model. Here’s how to prepare a text classification model for sentiment classification:
from sklearn.linear_model import PassiveAggressiveClassifier model = PassiveAggressiveClassifier() model.fit(X_train,y_train)
The dataset we used to train a sentiment classification model contains movie reviews. So let’s test the model by giving a movie review as an input:
user = input("Enter a Text: ")
data = cv.transform([user]).toarray()
output = model.predict(data)
print(output)Enter a Text: one of the worst movies I have ever seen! ['negative']
So this is how you can solve any problem of NLP using the Python programming language.
Summary
While working on any problem of NLP, we first need to:
- find a textual dataset
- then prepare the dataset by tokenization, stopwords removal, and stemming
- then perform text vectorization
- and then the last step is text classification or clustering
I hope you liked this article on the complete process of NLP using Python. Feel free to ask valuable questions in the comments section below.





