Python Modules for Data Science You Should Know

Learning Numpy, Pandas, Matplotlib, and other Python libraries for Data Science is essential for every Data Science job. However, there are some internal Python modules that people often miss and learn later. So, if you want to know about such modules, this article is for you. In this article, I’ll take you through some internal Python modules that every data science professional should be familiar with, including practical examples.

Python Modules for Data Science You Should Know

Below are some essential Python modules for Data Science you should know:

  1. pickle
  2. datetime
  3. os
  4. re (regular expressions)
  5. sqlite3
  6. csv

Let’s go through each of these with practical examples one by one.

Pickle

Pickle implements binary protocols for serializing and de-serializing a Python object structure. It is frequently used to save trained machine learning model objects to disk and load them as needed, which is crucial for deploying models or conducting repeated experiments without retraining from scratch.

Here’s an example of using Pickle for saving and loading a trained machine learning model:

import pickle
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# loading a sample data
iris = load_iris()
X, y = iris.data, iris.target

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train a logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# save the trained model to a file using pickle
model_filename = 'logistic_regression_model.pkl'
with open(model_filename, 'wb') as file:
    pickle.dump(model, file)

print(f"Model saved to {model_filename}")
Model saved to logistic_regression_model.pkl
# load the model from the file
with open(model_filename, 'rb') as file:
    loaded_model = pickle.load(file)

# use the loaded model to make predictions
y_loaded_pred = loaded_model.predict(X_test)
loaded_accuracy = accuracy_score(y_test, y_loaded_pred)
print(f"Loaded model accuracy: {loaded_accuracy:.2f}")
Loaded model accuracy: 1.00

datetime

It supplies classes for manipulating dates and times in both simple and complex ways. It is useful while dealing with time-related features in your datasets.

Here’s an example of using datetime with the yfinance library to collect real-time stock market data:

import pandas as pd
import yfinance as yf # do install it (pip install yfinance)
from datetime import date, timedelta

# define the time period for the data
end_date = date.today().strftime("%Y-%m-%d")
start_date = (date.today() - timedelta(days=365)).strftime("%Y-%m-%d")

# list of stock tickers to download
tickers = ['AAPL', 'MSFT', 'NFLX', 'GOOG', 'TSLA']

data = yf.download(tickers, start=start_date, end=end_date, progress=False)

# reset index to bring Date into the columns for the melt function
data = data.reset_index()

# melt the DataFrame to make it long format where each row is a unique combination of Date, Ticker, and attributes
data_melted = data.melt(id_vars=['Date'], var_name=['Attribute', 'Ticker'])

# pivot the melted DataFrame to have the attributes (Open, High, Low, etc.) as columns
data_pivoted = data_melted.pivot_table(index=['Date', 'Ticker'], columns='Attribute', values='value', aggfunc='first')

# reset index to turn multi-index into columns
stock_data = data_pivoted.reset_index()

print(stock_data.head())
Attribute       Date Ticker   Adj Close       Close        High         Low  \
0 2023-05-22 AAPL 173.279739 174.199997 174.710007 173.449997
1 2023-05-22 GOOG 125.870003 125.870003 127.050003 123.449997
2 2023-05-22 MSFT 318.687042 321.179993 322.589996 318.010010
3 2023-05-22 NFLX 363.010010 363.010010 372.010010 362.500000
4 2023-05-22 TSLA 188.869995 188.869995 189.320007 180.110001

Attribute Open Volume
0 173.979996 43570900.0
1 123.510002 29760200.0
2 318.600006 24115700.0
3 365.359985 5406400.0
4 180.699997 132001400.0

os

The os module provides a way of using operating system dependent functionality like reading or writing to a file, manipulating paths, and interacting with the operating system.

It’s essential for data scientists who need to interact with the file system to load data from logs, export results, set paths dynamically, and manage directories. It’s particularly useful in projects where data is spread across multiple directories or needs to be processed in batch scripts.

Here’s an example of using os to interact with the file system to load and save data:

import os

# creating a new directory and listing files in a directory
new_dir = "data_directory"
os.makedirs(new_dir, exist_ok=True)

# list current directory contents
print("Current directory contents:", os.listdir('.'))
Current directory contents: ['.config', 'data_directory', 'logistic_regression_model.pkl', 'sample_data']

re (regular expressions)

It provides regular expression matching operations for text data manipulation. It’s invaluable for data cleaning and text data processing, such as extracting dates, phone numbers, or other specific patterns from strings. It’s commonly used in preprocessing text for natural language processing tasks.

Here’s an example of extracting emails using re from a piece of text:

import re

text = "Please contact us at support@example.com or sales@example.com"
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)

print("Extracted emails:", emails)
Extracted emails: ['support@example.com', 'sales@example.com']

sqlite3

It’s used to implement an SQL interface which allows Python applications to access SQLite databases, a lightweight disk-based database. It’s useful for prototyping, small-scale applications, and for handling local data storage and retrieval in applications where setting up a full-scale database server isn’t necessary.

Here’s an example of storing and retrieving data in an SQLite database:

import sqlite3

# create a database and insert some data
connection = sqlite3.connect('example.db')
cursor = connection.cursor()

# create a table
cursor.execute('CREATE TABLE IF NOT EXISTS sales (date TEXT, amount INTEGER)')
connection.commit()

# insert some data
cursor.execute('INSERT INTO sales (date, amount) VALUES (?, ?)', ('2023-05-20', 1500))
connection.commit()

# query the data
cursor.execute('SELECT * FROM sales')
rows = cursor.fetchall()
for row in rows:
    print(row)

connection.close()
('2023-05-20', 1500)

csv

It helps in reading and writing tabular data in CSV format. It’s Integral for importing and exporting data from spreadsheets and databases into Python for further data manipulation, analysis, and visualization. It is a common task in many data science projects that involve data extraction and preprocessing.

Here’s an example of reading and writing CSV files:

import csv

# writing data to a CSV file
data = [['Name', 'Age', 'City'], ['Alice', 30, 'New York'], ['Bob', 25, 'Los Angeles']]

with open('people.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(data)

# reading data from a CSV file
with open('people.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)
['Name', 'Age', 'City']
['Alice', '30', 'New York']
['Bob', '25', 'Los Angeles']

Summary

So, below are some essential Python modules for Data Science you should know:

  1. pickle
  2. datetime
  3. os
  4. re (regular expressions)
  5. sqlite3
  6. csv

I hope you liked this article on essential Python modules for Data Science you should know. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2112

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading