Python Modules for Data Science You Should Know

Learning Numpy, Pandas, Matplotlib, and other Python libraries for Data Science is essential for every Data Science job. However, there are some internal Python modules that people often miss and learn later. So, if you want to know about such modules, this article is for you. In this article, I’ll take you through some internal Python modules that every data science professional should be familiar with, including practical examples.

Python Modules for Data Science You Should Know

Below are some essential Python modules for Data Science you should know:

pickle
datetime
os
re (regular expressions)
sqlite3
csv

Let’s go through each of these with practical examples one by one.

Pickle

Pickle implements binary protocols for serializing and de-serializing a Python object structure. It is frequently used to save trained machine learning model objects to disk and load them as needed, which is crucial for deploying models or conducting repeated experiments without retraining from scratch.

Here’s an example of using Pickle for saving and loading a trained machine learning model:

import pickle
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# loading a sample data
iris = load_iris()
X, y = iris.data, iris.target

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train a logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# save the trained model to a file using pickle
model_filename = 'logistic_regression_model.pkl'
with open(model_filename, 'wb') as file:
    pickle.dump(model, file)

print(f"Model saved to {model_filename}")

Model saved to logistic_regression_model.pkl

# load the model from the file
with open(model_filename, 'rb') as file:
    loaded_model = pickle.load(file)

# use the loaded model to make predictions
y_loaded_pred = loaded_model.predict(X_test)
loaded_accuracy = accuracy_score(y_test, y_loaded_pred)
print(f"Loaded model accuracy: {loaded_accuracy:.2f}")

Loaded model accuracy: 1.00

datetime

It supplies classes for manipulating dates and times in both simple and complex ways. It is useful while dealing with time-related features in your datasets.

Here’s an example of using datetime with the yfinance library to collect real-time stock market data:

import pandas as pd
import yfinance as yf # do install it (pip install yfinance)
from datetime import date, timedelta

# define the time period for the data
end_date = date.today().strftime("%Y-%m-%d")
start_date = (date.today() - timedelta(days=365)).strftime("%Y-%m-%d")

# list of stock tickers to download
tickers = ['AAPL', 'MSFT', 'NFLX', 'GOOG', 'TSLA']

data = yf.download(tickers, start=start_date, end=end_date, progress=False)

# reset index to bring Date into the columns for the melt function
data = data.reset_index()

# melt the DataFrame to make it long format where each row is a unique combination of Date, Ticker, and attributes
data_melted = data.melt(id_vars=['Date'], var_name=['Attribute', 'Ticker'])

# pivot the melted DataFrame to have the attributes (Open, High, Low, etc.) as columns
data_pivoted = data_melted.pivot_table(index=['Date', 'Ticker'], columns='Attribute', values='value', aggfunc='first')

# reset index to turn multi-index into columns
stock_data = data_pivoted.reset_index()

print(stock_data.head())

Attribute       Date Ticker   Adj Close       Close        High         Low  \
0         2023-05-22   AAPL  173.279739  174.199997  174.710007  173.449997   
1         2023-05-22   GOOG  125.870003  125.870003  127.050003  123.449997   
2         2023-05-22   MSFT  318.687042  321.179993  322.589996  318.010010   
3         2023-05-22   NFLX  363.010010  363.010010  372.010010  362.500000   
4         2023-05-22   TSLA  188.869995  188.869995  189.320007  180.110001   

Attribute        Open       Volume  
0          173.979996   43570900.0  
1          123.510002   29760200.0  
2          318.600006   24115700.0  
3          365.359985    5406400.0  
4          180.699997  132001400.0

os

The os module provides a way of using operating system dependent functionality like reading or writing to a file, manipulating paths, and interacting with the operating system.

It’s essential for data scientists who need to interact with the file system to load data from logs, export results, set paths dynamically, and manage directories. It’s particularly useful in projects where data is spread across multiple directories or needs to be processed in batch scripts.

Here’s an example of using os to interact with the file system to load and save data:

import os

# creating a new directory and listing files in a directory
new_dir = "data_directory"
os.makedirs(new_dir, exist_ok=True)

# list current directory contents
print("Current directory contents:", os.listdir('.'))

Current directory contents: ['.config', 'data_directory', 'logistic_regression_model.pkl', 'sample_data']

re (regular expressions)

It provides regular expression matching operations for text data manipulation. It’s invaluable for data cleaning and text data processing, such as extracting dates, phone numbers, or other specific patterns from strings. It’s commonly used in preprocessing text for natural language processing tasks.

Here’s an example of extracting emails using re from a piece of text:

import re

text = "Please contact us at support@example.com or sales@example.com"
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)

print("Extracted emails:", emails)

Extracted emails: ['support@example.com', 'sales@example.com']

sqlite3

It’s used to implement an SQL interface which allows Python applications to access SQLite databases, a lightweight disk-based database. It’s useful for prototyping, small-scale applications, and for handling local data storage and retrieval in applications where setting up a full-scale database server isn’t necessary.

Here’s an example of storing and retrieving data in an SQLite database:

import sqlite3

# create a database and insert some data
connection = sqlite3.connect('example.db')
cursor = connection.cursor()

# create a table
cursor.execute('CREATE TABLE IF NOT EXISTS sales (date TEXT, amount INTEGER)')
connection.commit()

# insert some data
cursor.execute('INSERT INTO sales (date, amount) VALUES (?, ?)', ('2023-05-20', 1500))
connection.commit()

# query the data
cursor.execute('SELECT * FROM sales')
rows = cursor.fetchall()
for row in rows:
    print(row)

connection.close()

('2023-05-20', 1500)

csv

It helps in reading and writing tabular data in CSV format. It’s Integral for importing and exporting data from spreadsheets and databases into Python for further data manipulation, analysis, and visualization. It is a common task in many data science projects that involve data extraction and preprocessing.

Here’s an example of reading and writing CSV files:

import csv

# writing data to a CSV file
data = [['Name', 'Age', 'City'], ['Alice', 30, 'New York'], ['Bob', 25, 'Los Angeles']]

with open('people.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

# reading data from a CSV file
with open('people.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

['Name', 'Age', 'City']
['Alice', '30', 'New York']
['Bob', '25', 'Los Angeles']

Summary

So, below are some essential Python modules for Data Science you should know:

pickle
datetime
os
re (regular expressions)
sqlite3
csv

I hope you liked this article on essential Python modules for Data Science you should know. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.