PySpark Practical Guide

PySpark is the Python API for Apache Spark, an open-source, distributed computing system that offers an interface for programming entire clusters with implicit data parallelism and fault tolerance. PySpark brings the power of Spark to Python developers, allowing them to leverage Spark’s distributed data processing capabilities using Python. So, if you want to learn how to use PySpark to work with data, this article is for you. In this article, I’ll take you through a practical guide to PySpark that will help you get started with PySpark.

Getting Started with PySpark

To take you through a complete practical guide to PySpark, I’ll use a dataset based on Delhi Metro that contains the following columns:

Station ID: A unique identifier for each station.
Station Name: The name of the metro station.
Distance from Start (km): The distance of the station from the start of the line, in kilometres.
Line: The metro line the station belongs to.
Opening Date: The date when the station was opened.
Station Layout: Indicates whether the station is elevated, underground, or at grade.
Latitude: The latitude coordinate of the station.
Longitude: The longitude coordinate of the station.

You can download this dataset from here.

Now, let’s get started with a practical guide to PySpark. Before moving forward, install it on your Python virtual environment by executing the command mentioned below:

pip install pyspark

PySpark Practical Guide

To use PySpark, we typically start by creating a SparkSession, an entry point to programming Spark with the Dataset and DataFrame API. Here’s how to create a SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Delhi Metro Analysis") \
    .getOrCreate()

Using PySpark, we can load data from various sources, including local filesystems, HDFS, and cloud storage. As we are using a CSV file here, below is how to load our data:

df = spark.read.csv("Delhi-Metro-Network.csv", header=True, inferSchema=True)
print(df.show())

+----------+--------------------+------------------------+------------+------------+--------------+-----------+-----------+
|Station ID|        Station Name|Distance from Start (km)|        Line|Opening Date|Station Layout|   Latitude|  Longitude|
+----------+--------------------+------------------------+------------+------------+--------------+-----------+-----------+
|         1|            Jhil Mil|                    10.3|    Red line|  2008-04-06|      Elevated|   28.67579|   77.31239|
|         2| Welcome [Conn: Red]|                    46.8|   Pink line|  2018-10-31|      Elevated|    28.6718|   77.27756|
|         3|         DLF Phase 3|                    10.0| Rapid Metro|  2013-11-14|      Elevated|    28.4936|    77.0935|
|         4|          Okhla NSIC|                    23.8|Magenta line|  2017-12-25|      Elevated| 28.5544828| 77.2648487|
|         5|          Dwarka Mor|                    10.2|   Blue line|  2005-12-30|      Elevated|   28.61932|   77.03326|
|         6|Dilli Haat INA [C...|                    24.9|   Pink line|  2018-06-08|   Underground|28.57440755|77.21024148|
|         7|    Noida Sector 143|                    11.5|   Aqua line|  2019-01-25|      Elevated|28.50266325|77.42625566|
|         8|           Moolchand|                    15.1| Voilet line|  2010-03-10|      Elevated|   28.56417|   77.23423|
|         9|        Chawri Bazar|                    15.3| Yellow line|  2005-03-07|   Underground|   28.64931|   77.22637|
|        10|           Maya Puri|                    12.8|   Pink line|  2018-03-14|      Elevated| 28.6371795| 77.1297333|
|        11|Central Secretari...|                    19.4| Yellow line|  2005-03-07|   Underground|   28.61474|   77.21191|
|        12|    Noida Sector 146|                    15.8|   Aqua line|  2019-01-25|      Elevated| 28.4082289|  76.963024|
|        13|        Tikri Border|                    20.2|  Green line|  2018-06-24|      Elevated| 28.6880254| 76.9640828|
|        14|            Jangpura|                    12.9| Voilet line|  2010-03-10|   Underground|    28.5843|   77.23766|
|        15|  Major Mohit Sharma|                     5.7|    Red line|  2019-08-03|      Elevated| 28.6776112| 77.3581426|
|        16|         Majlis Park|                     0.0|   Pink line|  2018-03-14|      Elevated| 28.7244312|  77.181964|
|        17|  Bhikaji Cama Place|                    22.6|   Pink line|  2018-06-08|   Underground|28.56790025|77.18701648|
|        18|Mundka Industrial...|                    15.0|  Green line|  2018-06-24|      Elevated| 28.6834491| 77.0171333|
|        19|    Belvedere Towers|                     8.0| Rapid Metro|  2013-11-14|      Elevated|    28.4936|    77.0935|
|        20|        Adarsh Nagar|                     4.7| Yellow line|  2009-04-02|      Elevated|   28.71642|   77.17046|
+----------+--------------------+------------------------+------------+------------+--------------+-----------+-----------+
only showing top 20 rows

None

Data Exploration Functions

Once the data is loaded, you can perform operations to explore and understand it better. Below are some functions we can use to explore our data using PySpark.

To understand the data types and structure:

print(df.printSchema())

root
 |-- Station ID: integer (nullable = true)
 |-- Station Name: string (nullable = true)
 |-- Distance from Start (km): double (nullable = true)
 |-- Line: string (nullable = true)
 |-- Opening Date: date (nullable = true)
 |-- Station Layout: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)

None

To focus on specific columns by selecting:

print(df.select("Station Name", "Line").show())

+--------------------+------------+
|        Station Name|        Line|
+--------------------+------------+
|            Jhil Mil|    Red line|
| Welcome [Conn: Red]|   Pink line|
|         DLF Phase 3| Rapid Metro|
|          Okhla NSIC|Magenta line|
|          Dwarka Mor|   Blue line|
|Dilli Haat INA [C...|   Pink line|
|    Noida Sector 143|   Aqua line|
|           Moolchand| Voilet line|
|        Chawri Bazar| Yellow line|
|           Maya Puri|   Pink line|
|Central Secretari...| Yellow line|
|    Noida Sector 146|   Aqua line|
|        Tikri Border|  Green line|
|            Jangpura| Voilet line|
|  Major Mohit Sharma|    Red line|
|         Majlis Park|   Pink line|
|  Bhikaji Cama Place|   Pink line|
|Mundka Industrial...|  Green line|
|    Belvedere Towers| Rapid Metro|
|        Adarsh Nagar| Yellow line|
+--------------------+------------+
only showing top 20 rows

None

To analyze specific portions of data by filtering:

print(df.filter(df["Line"] == "Blue line").show())

+----------+--------------------+------------------------+---------+------------+--------------+----------+----------+
|Station ID|        Station Name|Distance from Start (km)|     Line|Opening Date|Station Layout|  Latitude| Longitude|
+----------+--------------------+------------------------+---------+------------+--------------+----------+----------+
|         5|          Dwarka Mor|                    10.2|Blue line|  2005-12-30|      Elevated|  28.61932|  77.03326|
|        21|   Noida City Center|                    47.2|Blue line|  2009-12-11|      Elevated|  28.57466|  77.35608|
|        23|     Dwarka Sector 9|                     2.7|Blue line|  2006-01-04|      Elevated|  28.57487|  77.06454|
|        26|     R K Ashram Marg|                    28.9|Blue line|  2005-12-30|      Elevated|  28.63923|   77.2084|
|        27|    Uttam Nagar West|                    12.4|Blue line|  2005-12-30|      Elevated|  28.62481|   77.0653|
|        30|         Golf Course|                    45.9|Blue line|  2009-12-11|      Elevated|  28.56714|  77.34598|
|        35|Janak Puri West [...|                    14.7|Blue line|  2005-12-30|      Elevated|  28.62943|  77.07767|
|        37|Dwarka Sector 21(...|                     0.0|Blue line|  2010-10-30|   Underground|  28.55226|  77.05828|
|        38|       Subhash Nagar|                    17.6|Blue line|  2005-12-30|      Elevated|  28.64039|  77.10495|
|        45|    Dwarka Sector 14|                     7.6|Blue line|  2006-01-04|      Elevated|  28.60223|  77.02588|
|        49|      Rajendra Place|                    25.7|Blue line|  2005-12-30|      Elevated|   28.6425|  77.17815|
|        53|         Jhandewalan|                    27.9|Blue line|  2005-12-30|      Elevated|  28.64427|  77.19988|
|        60|          Moti Nagar|                    21.8|Blue line|  2005-12-30|      Elevated|  28.65784|  77.14248|
|        69|Mayur Vihar Phase...|                    38.3|Blue line|  2009-12-11|      Elevated|  28.60442|  77.29455|
|        72|     Noida Sector 59|                    51.5|Blue line|  2019-09-03|      Elevated|28.4808629|77.0848883|
|        74|     Dwarka Sector 8|                     1.7|Blue line|  2010-10-30|      Elevated|  28.56583|  77.06706|
|        75|     Noida Sector 62|                    52.7|Blue line|  2019-09-03|      Elevated|28.4808629|77.0848883|
|        78|Noida Sector 52 [...|                    49.3|Blue line|  2019-09-03|      Elevated|28.4808629|77.0848883|
|        84|Rajouri Garden [C...|                    19.6|Blue line|  2005-12-30|      Elevated|  28.64902|   77.1227|
|        89|     Noida Sector 61|                    50.5|Blue line|  2019-09-03|      Elevated|28.4808629|77.0848883|
+----------+--------------------+------------------------+---------+------------+--------------+----------+----------+
only showing top 20 rows

None

Data Manipulation

PySpark offers numerous functions for data manipulation including adding columns, renaming columns, and aggregating data. Below are some functions we can use for data manipulation using PySpark.

Adding a new column (for example, converting distance from km to miles):

from pyspark.sql.functions import col
df = df.withColumn("Distance from Start (miles)", col("Distance from Start (km)") * 0.621371)

Renaming a column:

df = df.withColumnRenamed("Distance from Start (km)", "DistanceKM")

Now let’s see an example of data aggregation as well:

from pyspark.sql import functions as F
df.groupBy("Line").agg(
    F.avg("DistanceKM").alias("Average Distance"),
    F.max("DistanceKM").alias("Max Distance")
).show()

+-----------------+------------------+------------+
|             Line|  Average Distance|Max Distance|
+-----------------+------------------+------------+
|      Rapid Metro|  5.70909090909091|        10.0|
|        Blue line|26.144897959183673|        52.7|
|Green line branch|1.0666666666666667|         2.1|
|       Green line|11.380952380952383|        24.8|
| Blue line branch|               4.0|         8.1|
|      Voilet line|20.617647058823525|        43.5|
|      Yellow line|21.462162162162162|        45.7|
|        Gray line|               1.8|         3.9|
|         Red line| 16.55862068965517|        32.7|
|        Pink line|28.773684210526323|        52.6|
|     Magenta line|            17.656|        33.1|
|      Orange line|10.566666666666665|        20.8|
|        Aqua line|13.352380952380955|        27.1|
+-----------------+------------------+------------+

PySpark allows you to run SQL queries on DataFrames after registering them as temporary views. Below is an example:

df.createOrReplaceTempView("metro")
spark.sql("SELECT Line, COUNT(*) as StationCount FROM metro GROUP BY Line").show()

+-----------------+------------+
|             Line|StationCount|
+-----------------+------------+
|      Rapid Metro|          11|
|        Blue line|          49|
|Green line branch|           3|
|       Green line|          21|
| Blue line branch|           8|
|      Voilet line|          34|
|      Yellow line|          37|
|        Gray line|           3|
|         Red line|          29|
|        Pink line|          38|
|     Magenta line|          25|
|      Orange line|           6|
|        Aqua line|          21|
+-----------------+------------+

Plotting Graphs

Plotting and visualization are essential for data exploration and analysis, providing a graphical representation of your data to uncover patterns, trends, and outliers. While PySpark itself does not offer built-in plotting capabilities, you can use PySpark to preprocess and aggregate your data, then convert it to a Pandas DataFrame for visualization with libraries like Matplotlib, Seaborn, or Plotly.

Before plotting, we often need to aggregate, filter, or otherwise prepare data using PySpark. For example, if we wanted to visualize the number of stations per line in the Delhi Metro dataset, we might start with:

stations_per_line = df.groupBy("Line").count()

Once the data is aggregated or filtered as needed, we can convert the PySpark DataFrame to a Pandas DataFrame. This conversion brings the data onto a single node, so it’s best done with aggregated data or subsets to avoid memory issues. Here’s how to convert it into a pandas dataframe:

stations_per_line_pd = stations_per_line.toPandas()

Now, below is an example graph using Matplotlib:

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(stations_per_line_pd['Line'], stations_per_line_pd['count'], marker='o')
plt.title('Number of Stations per Line')
plt.xlabel('Line')
plt.ylabel('Number of Stations')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

PySpark Practical Guide: plotting graphs

Summary

So, these were some essential fundamentals of PySpark. You can find many more examples of PySpark functions from the official documentation here. As a data professional, you need to use PySpark when your dataset is too large to be processed efficiently with Pandas. It is particularly relevant for big data analytics, where datasets can range from gigabytes to petabytes.

I hope you liked this article on a practical guide to PySpark to work with data. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.