Data Collection with an API using Python

APIs, or Application Programming Interfaces, are sets of protocols and tools that allow different software applications to communicate with each other. Data collection using APIs involves accessing these interfaces provided by online services, platforms, or data providers to retrieve structured data. Using an API for data collection is a powerful way to obtain real-time or historical data for data analysis, machine learning models, or any data-driven application. So, if you want to learn how to create a dataset by collecting data with an API, this article is for you. In this article, I’ll take you through the task of Data Collection with an API using Python.

Data Collection with an API

Most of the time, a data engineer is responsible for working with APIs to collect data and create datasets according to the needs of the business. Below is the process you can follow for the task of data collection with an API:

  1. Clearly outline what data is needed, the purpose of the data collection, and how it will be used in your analysis or modelling.
  2. Read API documentation to know what data you can get, in what format you can get the data, and how you can get it.
  3. Register or sign up to use the API, if necessary, to obtain API keys.
  4. Use programming languages that support HTTP requests, like Python with libraries such as requests or urllib for making API calls.
  5. Develop a script that makes requests to the API endpoints you identified. Handle pagination and iterate over pages of data if the API splits the data across multiple responses.
  6. Code the script to parse the received data (usually in JSON or XML format) and convert it into a usable format like a DataFrame in Python using Pandas.

So, in this article, I will be using the Spotify API to collect real-time music data from Spotify and create a dataset of music with their features and popularity.

Data Collection with Spotify API using Python

So, to collect data from the Spotify API, you first need to know what data you can get and in what format the data comes in. You can learn everything about it from here.

Now, follow the process mentioned below to sign up for using the API for data collection:

  1. Create a Spotify Developer account at Spotify for Developers.
  2. Go to Create an App and get your Client ID and Client Secret.
  3. If it asks for a website, feel free to use statso.io if you don’t have a website.

Feel free to reach me on Instagram or LinkedIn if you face any errors in the signup process.

Now, let’s start with the data collection task with the Spotify API. I’ll first write code to authenticate with the Spotify API and obtain an access token using the Client Credentials Flow:

import requests
import base64

# replace with your own client id and client secret
CLIENT_ID = 'Your Client ID'
CLIENT_SECRET = 'Your Client Secret'

# Base64 encode the client id and client secret
client_credentials = f"{CLIENT_ID}:{CLIENT_SECRET}"
client_credentials_base64 = base64.b64encode(client_credentials.encode())

# request the access token
token_url = 'https://accounts.spotify.com/api/token'
headers = {
    'Authorization': f'Basic {client_credentials_base64.decode()}'
}
data = {
    'grant_type': 'client_credentials'
}
response = requests.post(token_url, data=data, headers=headers)

if response.status_code == 200:
    access_token = response.json()['access_token']
    print("Access token obtained successfully.")
else:
    print("Error obtaining access token.")
    exit()

The access token obtained is crucial as it is used in subsequent requests to the Spotify API to authenticate and authorize those requests. Without this token, your application will not be able to interact with Spotify’s data and services under the Client Credentials flow. This flow is specifically used for server-to-server interactions where no user authorization is required, which is suitable for accessing publicly available information such as music data, playlists, etc.

Now, install Spotify’s official Python API known as Spotipy. You can install it on your Python environment by executing the command below on your terminal or command prompt:

  • pip install spotipy

Now, I’ll write a Python function to extract detailed information about each track in any Spotify playlist:

import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyOAuth

def get_trending_playlist_data(playlist_id, access_token):
    # set up spotipy with the access token
    sp = spotipy.Spotify(auth=access_token)

    # get the tracks from the playlist
    playlist_tracks = sp.playlist_tracks(playlist_id, fields='items(track(id, name, artists, album(id, name)))')

    # extract relevant information and store in a list of dictionaries
    music_data = []
    for track_info in playlist_tracks['items']:
        track = track_info['track']
        track_name = track['name']
        artists = ', '.join([artist['name'] for artist in track['artists']])
        album_name = track['album']['name']
        album_id = track['album']['id']
        track_id = track['id']

        # get audio features for the track
        audio_features = sp.audio_features(track_id)[0] if track_id != 'Not available' else None

        # get release date of the album
        try:
            album_info = sp.album(album_id) if album_id != 'Not available' else None
            release_date = album_info['release_date'] if album_info else None
        except:
            release_date = None

        # get popularity of the track
        try:
            track_info = sp.track(track_id) if track_id != 'Not available' else None
            popularity = track_info['popularity'] if track_info else None
        except:
            popularity = None

        # add additional track information to the track data
        track_data = {
            'Track Name': track_name,
            'Artists': artists,
            'Album Name': album_name,
            'Album ID': album_id,
            'Track ID': track_id,
            'Popularity': popularity,
            'Release Date': release_date,
            'Duration (ms)': audio_features['duration_ms'] if audio_features else None,
            'Explicit': track_info.get('explicit', None),
            'External URLs': track_info.get('external_urls', {}).get('spotify', None),
            'Danceability': audio_features['danceability'] if audio_features else None,
            'Energy': audio_features['energy'] if audio_features else None,
            'Key': audio_features['key'] if audio_features else None,
            'Loudness': audio_features['loudness'] if audio_features else None,
            'Mode': audio_features['mode'] if audio_features else None,
            'Speechiness': audio_features['speechiness'] if audio_features else None,
            'Acousticness': audio_features['acousticness'] if audio_features else None,
            'Instrumentalness': audio_features['instrumentalness'] if audio_features else None,
            'Liveness': audio_features['liveness'] if audio_features else None,
            'Valence': audio_features['valence'] if audio_features else None,
            'Tempo': audio_features['tempo'] if audio_features else None,
            # add more attributes as needed (go through the documentation to know what more you can add)
        }

        music_data.append(track_data)

    # create a pandas dataframe from the list of dictionaries
    df = pd.DataFrame(music_data)

    return df

Now, let’s use our function get_trending_playlist_data using a specific Spotify playlist ID and an already obtained access token:

# you can add the playlist id of any playlist on Spotify here
playlist_id = '1gfWsOG1WAoxNeUMMktZbq'

# call the function to get the music data from the playlist and store it in a DataFrame
music_df = get_trending_playlist_data(playlist_id, access_token)

print(music_df)
                                     Track Name  \
0 Bijlee Bijlee
1 Expert Jatt
2 Kaun Nachdi (From "Sonu Ke Titu Ki Sweety")
3 Na Na Na Na
4 Patiala Peg
.. ...
95
96 Move Your Lakk
97 Patola (From "Blackmail")
98 Ban Ja Rani (From "Tumhari Sulu")
99 Hauli Hauli (From "De De Pyaar De")

Artists \
0 Harrdy Sandhu
1 Nawab
2 Guru Randhawa, Neeti Mohan
3 J Star
4 Diljit Dosanjh
.. ...
95
96 Diljit Dosanjh, Badshah, Sonakshi Sinha
97 Guru Randhawa, Preet Hundal
98 Guru Randhawa
99 Garry Sandhu, Neha Kakkar, Mellow D

Album Name Album ID \
0 Bijlee Bijlee 3tG0IGB24sRhGFLs5F1Km8
1 Expert Jatt 2gibg5SCTep0wsIMefGzkd
2 High Rated Gabru - Guru Randhawa 6EDbwGsQNQRLf73c7QwZ2f
3 Na Na Na Na 4xBqgoiRSOMU1VlKuntVQW
4 Do Gabru - Diljit Dosanjh & Akhil 1uxDllRe9CPhdr8rhz2QCZ
.. ... ...
95 2jw92hf4mnISbYywvU3Anj
96 Move Your Lakk 0V06TMGQQQkvKxNmFlKyEj
97 Patola (From "Blackmail") 2XAAIDEpPb57NsKgAHLGVQ
98 High Rated Gabru - Guru Randhawa 6EDbwGsQNQRLf73c7QwZ2f
99 Dance Syndrome 6e1XB070vlPaxGDAsi8AF6

Track ID Popularity Release Date Duration (ms) Explicit \
0 1iZLpuGMr4tn1F5bZu32Kb 70 2021-10-30 168450 False
1 7rr6n1NFIcQXCsi43P0YNl 65 2018-01-18 199535 False
2 3s7m0jmCXGcM8tmlvjCvAa 64 2019-03-02 183373 False
3 5GjxbFTZAMhrVfVrNrrwrG 52 2015 209730 False
4 6TikcWOLRsPq66GBx2jk67 46 2018-07-10 188314 False
.. ... ... ... ... ...
95 3OZr3vo7SmYpn5XqeQEAOM 0 0000 203207 False
96 3aYMKdSitJeHUCZO8Wt6fw 51 2017-03-29 194568 False
97 17LZzRCY0iFWlDDuAG7BlM 57 2018-03-05 184410 False
98 7cQtGVoPCK9DlspeYjdHOA 60 2019-03-02 225938 False
99 4XyKoSEWrkQjI4AekJYWNx 39 2019-09-03 209393 False

External URLs ... Energy Key \
0 https://open.spotify.com/track/1iZLpuGMr4tn1F5... ... 0.670 1
1 https://open.spotify.com/track/7rr6n1NFIcQXCsi... ... 0.948 6
2 https://open.spotify.com/track/3s7m0jmCXGcM8tm... ... 0.830 4
3 https://open.spotify.com/track/5GjxbFTZAMhrVfV... ... 0.863 3
4 https://open.spotify.com/track/6TikcWOLRsPq66G... ... 0.811 5
.. ... ... ... ...
95 https://open.spotify.com/track/3OZr3vo7SmYpn5X... ... 0.842 6
96 https://open.spotify.com/track/3aYMKdSitJeHUCZ... ... 0.816 2
97 https://open.spotify.com/track/17LZzRCY0iFWlDD... ... 0.901 3
98 https://open.spotify.com/track/7cQtGVoPCK9Dlsp... ... 0.692 9
99 https://open.spotify.com/track/4XyKoSEWrkQjI4A... ... 0.982 1

Loudness Mode Speechiness Acousticness Instrumentalness Liveness \
0 -5.313 0 0.1430 0.26900 0.000000 0.0733
1 -2.816 0 0.1990 0.29800 0.000000 0.0784
2 -3.981 0 0.0455 0.03570 0.000000 0.0419
3 -3.760 1 0.0413 0.37600 0.000014 0.0916
4 -3.253 0 0.1840 0.02590 0.000000 0.3110
.. ... ... ... ... ... ...
95 -4.109 1 0.0745 0.00814 0.000013 0.2120
96 -5.595 1 0.1480 0.03790 0.000153 0.1230
97 -2.719 0 0.0508 0.12600 0.000000 0.0692
98 -4.718 0 0.0577 0.20600 0.000000 0.1240
99 -3.376 1 0.0788 0.02120 0.000032 0.3370

Valence Tempo
0 0.643 100.004
1 0.647 172.038
2 0.753 127.999
3 0.807 95.000
4 0.835 175.910
.. ... ...
95 0.915 156.051
96 0.744 99.992
97 0.914 87.998
98 0.487 102.015
99 0.571 94.990

[100 rows x 21 columns]

To get the playlist ID of any other playlist on Spotify, just copy the link of the playlist and below is how to identify the playlist ID from the URL of the playlist:

spotify playlist for data collection api
The highlighted part of the URL is the playlist ID

Now, here’s how you can add this data to a CSV file:

music_df.to_csv("musicdata.csv")

Similarly, interacting with other APIs requires you to follow the same process. Reading the documentation thoroughly is half of the steps and writing the script for data collection is the other half.

Summary

So, this is how you can collect data from an API using Python. Using an API for data collection is a powerful way to obtain real-time or historical data for data analysis, machine learning models, or any data-driven application.

I hope you liked this article on Data Collection with an API using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2025

2 Comments

  1. I have a doubt that we use similar technique for gathering data for other applications or apps in similar way or not.

    • This is an example of collecting data from spotify. Similarly, for collecting data through other APIs, you first need to read the documentation, understand what features you can extract, and what functions and methods the API gives. Yes, the logic to collect and create the data remains the same, only interacting with the API part will change for different APIs.

      Some other APIs you should try are:

      DMRC’s api to collect delhi metro data
      Reddit API

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading