Geospatial Clustering with Python

Geospatial clustering is the process of grouping geographical points, such as delivery locations, store addresses, or sensor coordinates, into clusters based on their physical proximity to the Earth’s surface. If you want to learn about geospatial clustering as a Data Scientist, this article is for you. In this article, I’ll take you through a practical guide to geospatial clustering with Python.

What is Geospatial Clustering?

At its core, geospatial clustering is just unsupervised learning applied to latitude and longitude data, but the goal is always the same: to find meaningful groupings or patterns in spatial data to make location-based decisions smarter.

Let’s look at some applications of geospatial clustering:

  1. Logistics: When you want to create delivery zones.
  2. Retail: When you want to identify dense areas of customer activity to open a new store.
  3. Urban planning: When you want to detect high-demand zones for public transport.
  4. Crime analysis: When you want to find crime hotspots.

All these use cases have two things in common:

  1. You’re working with location data,
  2. You want to uncover natural groupings.

Geospatial Clustering with Python

Now, let’s see how to perform geospatial clustering using Python as a Data Scientist. The dataset I will be using for this task is based on delivery pickups and drop locations. You can download this dataset from here.

Now, let’s import the necessary Python libraries and the dataset:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from geopy.distance import geodesic

data = pd.read_csv("/content/deliverytime.txt")
data.head()
dataset for Geospatial Clustering
The dataset contains more columns

Now, we will calculate the real-world distance between the pickup point and the delivery location using the geodesic formula:

def calculate_distance(row):
    return geodesic(
        (row['Restaurant_latitude'], row['Restaurant_longitude']),
        (row['Delivery_location_latitude'], row['Delivery_location_longitude'])
    ).km

data['Distance_km'] = data.apply(calculate_distance, axis=1)

Here, we defined a function, calculate_distance, that takes a row of the dataset and computes the geographic (real-world) distance in kilometres between the restaurant and delivery coordinates using the geodesic method from the geopy library. We then used .apply() with axis=1 to apply this function row-wise and create a new column Distance_km, containing the distance for each delivery.

Now, let’s visualize all delivery locations across India on an interactive map using Plotly:

import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scattergeo(
    lon=data['Delivery_location_longitude'],
    lat=data['Delivery_location_latitude'],
    mode='markers',
    marker=dict(color='blue', size=6, opacity=0.7),
    name='Delivery Locations',
    hovertemplate='Lat: %{lat:.4f}<br>Lon: %{lon:.4f}<extra>Delivery</extra>'
))

fig.update_layout(
    title='📦 Mapping Our Reach — Delivery Locations Across India',
    geo=dict(
        scope='asia',
        showland=True,
        landcolor='rgb(229, 229, 229)',
        showcountries=True,
        countrycolor='rgb(200, 200, 200)',
        showlakes=False,
        lonaxis=dict(range=[68, 98]),  # focus on India
        lataxis=dict(range=[6, 38])
    ),
    margin=dict(l=0, r=0, t=60, b=0),
    showlegend=False
)

fig.show()
Mapping Reach — Delivery Locations Across India

The graph shows that delivery activity is concentrated predominantly in the southern and central regions of India, with notable clusters around states like Karnataka, Tamil Nadu, and Maharashtra. There’s also a moderate spread into central and eastern parts, but relatively fewer delivery points in the northern and northeastern zones, indicating potential regions for service expansion or underutilization.

Performing K-Means Clustering

Now, let’s perform K-Means clustering on delivery locations and visualize the clusters along with their geographic centroids:

from sklearn.cluster import KMeans

X = data[['Delivery_location_latitude', 'Delivery_location_longitude']]
k = 3
kmeans = KMeans(n_clusters=k, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_

fig = go.Figure()

for cluster_label in sorted(data['Cluster'].unique()):
    cluster_data = data[data['Cluster'] == cluster_label]
    fig.add_trace(go.Scattergeo(
        lon=cluster_data['Delivery_location_longitude'],
        lat=cluster_data['Delivery_location_latitude'],
        mode='markers',
        name=f'Cluster {cluster_label}',
        marker=dict(size=6, opacity=0.7),
        hovertemplate='<b>Cluster:</b> %{text}<br>Lat: %{lat:.4f}<br>Lon: %{lon:.4f}<extra></extra>',
        text=[f"{cluster_label}"] * len(cluster_data)
    ))

fig.add_trace(go.Scattergeo(
    lon=centroids[:, 1],
    lat=centroids[:, 0],
    mode='markers',
    name='Centroids',
    marker=dict(size=15, symbol='x', color='red', line=dict(width=2, color='black')),
    hovertemplate='<b>Centroid</b><br>Lat: %{lat:.4f}<br>Lon: %{lon:.4f}<extra></extra>'
))

fig.update_layout(
    title=f'📍 Geo-Spatial Clustering of Delivery Locations (k = {k})',
    geo=dict(
        scope='asia',
        showland=True,
        landcolor="rgb(229, 229, 229)",
        showcountries=True,
        countrycolor="rgb(204, 204, 204)",
        lonaxis=dict(range=[68, 98]),
        lataxis=dict(range=[6, 38]),
    ),
    legend_title='Clusters',
    margin=dict(l=0, r=0, t=60, b=0)
)

fig.show()
Geo-Spatial Clustering of Delivery Locations

Cluster 0 (blue) represents the Central Delivery Zone, covering areas like Maharashtra and Madhya Pradesh, while Cluster 2 (green) forms the Southern Delivery Zone, focused around Tamil Nadu and Karnataka. However, Cluster 1 includes points that lie outside Indian geographic boundaries, indicating outliers or invalid coordinates likely due to GPS errors or data entry issues.

Now, let’s remove the outlier cluster and label valid delivery segments for optimized logistics planning:

filtered_data = data[data['Cluster'] != 1]
filtered_centroids = centroids[[0, 2]]  # Keep only Cluster 0 and 2

# Step 3: Map cluster names
cluster_labels = {
    0: "Central Delivery Zone",
    2: "Southern Delivery Zone"
}
filtered_data['Optimized_Zone'] = filtered_data['Cluster'].map(cluster_labels)

Here, we filtered out Cluster 1, which represents outliers located outside India’s geographical boundaries. Then, we renamed the remaining valid clusters (Cluster 0 as “Central Delivery Zone” and Cluster 2 as “Southern Delivery Zone”) to give business context to the spatial segments. This final step transforms raw geospatial clusters into meaningful delivery zones that can be used for route optimization, staffing, and strategic planning.

Summary

So, geospatial clustering is just unsupervised learning applied to latitude and longitude data, but the goal is always the same: to find meaningful groupings or patterns in spatial data to make location-based decisions smarter. I hope you liked this article on geospatial clustering using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2063

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading