Music Popularity Prediction with Python

Music popularity prediction involves developing machine learning models to estimate the popularity of music tracks based on their audio features. Predicting the popularity of music can help music streaming platforms understand user preferences, optimize playlists and enhance recommendation systems to improve user engagement and satisfaction. So, if you want to learn how to train a Machine Learning model for music popularity prediction, this article is for you. In this article, I’ll take you through the task of music popularity prediction with Machine Learning using Python.

Music Popularity Prediction: Overview

Music popularity prediction means using regression techniques to forecast the popularity of songs based on various music features and metadata. Expected results include accurate predictions of a song’s future performance in terms of streams, downloads, and chart positions, which enable music producers, artists, and marketers to make informed decisions.

To get started with music popularity prediction, we need a dataset of various songs with their musical features and historical data on how much popularity the songs got. I found an ideal dataset for this task which includes 227 music tracks, each described by their music features along with additional metadata like track name, artists, album name, and release date. You can download the dataset from here.

Music Popularity Prediction with Python

Now, let’s get started with the task of music popularity prediction by importing the necessary Python libraries and the dataset:

import pandas as pd

spotify_data = pd.read_csv("Spotify_data.csv")

print(spotify_data.head())
   Unnamed: 0                  Track Name  \
0 0 Not Like Us
1 1 Houdini
2 2 BAND4BAND (feat. Lil Baby)
3 3 I Don't Wanna Wait
4 4 Pedro

Artists Album Name \
0 Kendrick Lamar Not Like Us
1 Eminem Houdini
2 Central Cee, Lil Baby BAND4BAND (feat. Lil Baby)
3 David Guetta, OneRepublic I Don't Wanna Wait
4 Jaxomy, Agatino Romero, Raffaella Carrà Pedro

Album ID Track ID Popularity Release Date \
0 5JjnoGJyOxfSZUZtk2rRwZ 6AI3ezQ4o3HUoP6Dhudph3 96 2024-05-04
1 6Xuu2z00jxRPZei4IJ9neK 2HYFX63wP3otVIvopRS99Z 94 2024-05-31
2 4AzPr5SUpNF553eC1d3aRy 7iabz12vAuVQYyekFIWJxD 91 2024-05-23
3 0wCLHkBRKcndhMQQpeo8Ji 331l3xABO0HMr1Kkyh2LZq 90 2024-04-05
4 5y6RXjI5VPR0RyInghTbf1 48lxT5qJF0yYyf2z4wB4xW 89 2024-03-29

Duration (ms) Explicit ... Energy Key Loudness Mode Speechiness \
0 274192 True ... 0.472 1 -7.001 1 0.0776
1 227239 True ... 0.887 9 -2.760 0 0.0683
2 140733 True ... 0.764 11 -5.241 1 0.2040
3 149668 False ... 0.714 1 -4.617 0 0.0309
4 144846 False ... 0.936 9 -6.294 1 0.3010

Acousticness Instrumentalness Liveness Valence Tempo
0 0.0107 0.000000 0.1410 0.214 101.061
1 0.0292 0.000002 0.0582 0.889 127.003
2 0.3590 0.000000 0.1190 0.886 140.113
3 0.0375 0.000000 0.2320 0.554 129.976
4 0.0229 0.000001 0.3110 0.844 151.019

[5 rows x 22 columns]

The dataset has an unnamed column, I’ll drop it and move forward:

spotify_data.drop(columns=['Unnamed: 0'], inplace=True)

Now, let’s have a look at the column info before moving forward:

spotify_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Track Name 227 non-null object
1 Artists 227 non-null object
2 Album Name 227 non-null object
3 Album ID 227 non-null object
4 Track ID 227 non-null object
5 Popularity 227 non-null int64
6 Release Date 227 non-null object
7 Duration (ms) 227 non-null int64
8 Explicit 227 non-null bool
9 External URLs 227 non-null object
10 Danceability 227 non-null float64
11 Energy 227 non-null float64
12 Key 227 non-null int64
13 Loudness 227 non-null float64
14 Mode 227 non-null int64
15 Speechiness 227 non-null float64
16 Acousticness 227 non-null float64
17 Instrumentalness 227 non-null float64
18 Liveness 227 non-null float64
19 Valence 227 non-null float64
20 Tempo 227 non-null float64
dtypes: bool(1), float64(9), int64(4), object(7)
memory usage: 35.8+ KB

Now, let’s get started with EDA. As popularity is the target variable, I’ll have a look at the relationship between all the music features with popularity:

import matplotlib.pyplot as plt
import seaborn as sns
features = ['Energy', 'Valence', 'Danceability', 'Loudness', 'Acousticness']
for feature in features:
    plt.figure(figsize=(8, 5))
    sns.scatterplot(data=spotify_data, x=feature, y='Popularity')
    plt.title(f'Popularity vs {feature}')
    plt.show()
Music Popularity Prediction: popularity vs energy
valence
Music Popularity Prediction: popularity vs danceability
loudness
Music Popularity Prediction: acousticness

From these visualizations, we can observe that higher energy levels and danceability tend to correlate positively with higher popularity scores. Conversely, increased acousticness and lower loudness levels generally correspond with lower popularity, suggesting that more energetic and less acoustic tracks are favoured. Valence shows a weaker, less clear relationship with popularity, indicating that the emotional positivity of a track alone doesn’t strongly predict its popularity.

Now, let’s have a look at the correlation between all the features:

numeric_columns = spotify_data.select_dtypes(include=['float64', 'int64']).columns
numeric_data = spotify_data[numeric_columns]

corr_matrix = numeric_data.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
Correlation Matrix

From the above correlation matrix, we can see that popularity has a moderate positive correlation with loudness (0.31) and danceability (0.25), indicating that louder and more danceable tracks tend to be more popular. There is a moderate negative correlation between popularity and acousticness (-0.43), suggesting that tracks with higher acousticness are generally less popular. Energy also has a positive correlation with popularity (0.25).

Now, let’s have a look at the distribution of all the music features:

for feature in features:
    plt.figure(figsize=(8, 5))
    sns.histplot(spotify_data[feature], kde=True)
    plt.title(f'Distribution of {feature}')
    plt.show()
Music Popularity Prediction: Distribution of energy
Distribution of valence
Music Popularity Prediction: distribution of danceability
Distribution of loudness
Music Popularity Prediction: Distribution of acousticness

The distribution of energy is roughly bell-shaped, which indicates a balanced range of energy levels in the tracks. Valence and danceability also follow a similar distribution, with most tracks having mid-range values, which suggests an even mix of emotionally positive and danceable tracks. Loudness has a near-normal distribution centred around -6 dB, which reflects typical volume levels in the dataset. Acousticness, however, is skewed towards lower values, indicating that most tracks are not highly acoustic.

Feature Selection and Model Training

Based on the correlation analysis and visualizations, the following features show a significant relationship with popularity and can be used to train a music popularity prediction model:

  • Energy
  • Valence
  • Danceability
  • Loudness
  • Acousticness
  • Tempo
  • Speechiness
  • Liveness

These features capture various audio characteristics that influence the popularity of music tracks.

The next step is to train a Machine Learning model to predict the popularity of music using the features we have selected. So, let’s split and scale the data and then train the model using the random forest regression algorithm:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# select the features and target variable
features = ['Energy', 'Valence', 'Danceability', 'Loudness', 'Acousticness', 'Tempo', 'Speechiness', 'Liveness']
X = spotify_data[features]
y = spotify_data['Popularity']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# normalize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# define the parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search_rf = GridSearchCV(RandomForestRegressor(random_state=42), param_grid, refit=True, verbose=2, cv=5)

grid_search_rf.fit(X_train_scaled, y_train)

best_params_rf = grid_search_rf.best_params_

best_rf_model = grid_search_rf.best_estimator_

y_pred_best_rf = best_rf_model.predict(X_test_scaled)

Note: I selected the random forest algorithm after going through various algorithms. The random forest algorithm resulted in better performance in comparison to the other algorithms after hyperparameter tuning.

Now, let’s have a look at the actual vs predicted results of the test data:

# make predictions
y_pred_best_rf = best_rf_model.predict(X_test_scaled)

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_best_rf, alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2)
plt.xlabel('Actual Popularity')
plt.ylabel('Predicted Popularity')
plt.title('Actual vs Predicted Popularity (Best Random Forest Model)')
plt.show()
Music Popularity Prediction: actual vs predicted

The red line represents perfect predictions, where the predicted popularity would exactly match the actual popularity. Most of the points are clustered around this line, which indicates that the model is making reasonably accurate predictions. However, there are some deviations, particularly at lower popularity values, which suggest areas where the model’s predictions are less precise.

Summary

So, this is how we can train a Machine Learning model for the task of Music popularity prediction with Python. Predicting the popularity of music can help music streaming platforms understand user preferences, optimize playlists and enhance recommendation systems to improve user engagement and satisfaction.

I hope you liked this article on music popularity prediction with Machine Learning using Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2023

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading