Audio data is ubiquitous today, from music streaming platforms to virtual assistants. Analyzing and processing audio requires a solid understanding of data manipulation and visualization techniques. So, if you want to know about audio data processing and analysis, this article is for you. In this article, I’ll take you through the task of audio data processing and analysis with Python.
Audio Data Processing and Analysis with Python
The audio data I will use for this task is the NSynth (Neural Synthesizer) dataset, created by Google, which is a large-scale dataset for audio synthesis research. It consists of over 300,000 musical notes, each with a unique combination of instrument, pitch, and timbre. The dataset includes recordings from a diverse range of instruments, categorized into families like strings, brass, and mallets. Each audio sample is labelled with metadata, such as pitch, velocity, and instrument family.
So, let’s get started with the task of audio data processing and analysis by importing the dataset:
import tensorflow as tf
import tensorflow_datasets as tfds
# load the NSynth dataset
dataset, info = tfds.load('nsynth', split='train', with_info=True)
print(info)Before diving into analysis, it’s crucial to understand the structure of the dataset. So, let’s have a quick look at the keys in the dataset:
# inspect the keys of one sample
for sample in dataset.take(1):
print("Available keys:")
for key in sample.keys():
print(key)Available keys:
audio
id
instrument
pitch
qualities
velocity
Preprocessing Audio Data
To analyze the dataset, we’ll extract the audio and use the pitch as an alternate label:
# Extract audio and an alternate label (e.g., pitch)
def preprocess_nsynth(sample):
audio = sample['audio']
label = sample['pitch'] # Use pitch as the label
return audio, label
# Apply preprocessing
processed_dataset = dataset.map(preprocess_nsynth)
# Take a single sample
for audio, label in processed_dataset.take(1):
print(f"Audio Shape: {audio.shape}")
print(f"Label (Pitch): {label.numpy()}")Audio Shape: (64000,)
Label (Pitch): 106
After preprocessing, you can convert the audio tensor to a NumPy array and play it using the IPython Audio display:
from IPython.display import Audio audio_np = audio.numpy() Audio(audio_np, rate=16000) # Assuming a sample rate of 16kHz
Analyzing Audio Data
Now, let’s move to audio data analysis. I’ll start with visualizing the waveform to better understand the audio structure:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(
y=audio_np,
mode='lines',
line=dict(color='black'),
name="Waveform"
))
fig.update_layout(
title="Waveform",
xaxis_title="Time (samples)",
yaxis_title="Amplitude",
template="plotly_white",
width=800,
height=400
)
fig.show()
The waveform graph displays a decaying amplitude over time, starting with a high magnitude and gradually tapering off to zero. This indicates that the audio signal begins with a strong onset, followed by a rapid decay in energy. Such behaviour is typical in audio signals like percussive notes or short instrumental sounds, where the initial strike produces high energy that dissipates quickly.
Analyzing the Spectrogram
Now, let’s analyze the spectrogram, which provides a time-frequency representation of audio:
import librosa
import numpy as np
# compute the STFT
spectrogram = librosa.stft(audio_np, n_fft=512, hop_length=256)
spectrogram_db = librosa.amplitude_to_db(abs(spectrogram))
time = np.linspace(0, len(audio_np) / 16000, spectrogram_db.shape[1])
frequencies = np.linspace(0, 16000 / 2, spectrogram_db.shape[0])
fig = go.Figure(data=go.Heatmap(
z=spectrogram_db,
x=time,
y=frequencies,
colorscale='Viridis',
colorbar=dict(title='Amplitude (dB)'),
))
fig.update_layout(
title="Spectrogram",
xaxis_title="Time (seconds)",
yaxis_title="Frequency (Hz)",
yaxis=dict(type="log"),
template="plotly"
)
fig.show()
The spectrogram reveals that the audio signal primarily contains a single prominent frequency component around 400 Hz, which remains consistent throughout its duration. The amplitude of this frequency is high, as indicated by the bright colour, while other frequencies show minimal or no energy. Additionally, there are faint low-frequency components near the start of the signal, which suggests a brief presence of lower-pitched content. This pattern suggests a sustained note, likely from a single instrument, with little harmonic variation or timbral complexity.
Analyzing Instrument Distribution
Now, let’s understand the distribution of instrument families in the dataset:
from collections import Counter
# count instrument occurrences
instrument_counts = Counter()
for sample in dataset.take(1000):
instrument = sample['instrument']['family'].numpy()
instrument_counts[instrument] += 1
# map numeric IDs to instrument family names
instrument_families = ["Bass", "Brass", "Flute", "Guitar", "Keyboard", "Mallet", "Organ", "Reed", "String", "Synth Lead", "Synth Pad", "Vocal"]
mapped_family_counts = {instrument_families[family_id]: count for family_id, count in instrument_counts.items()}
import plotly.express as px
fig = px.bar(
x=list(mapped_family_counts.keys()),
y=list(mapped_family_counts.values()),
labels={'x': 'Instrument Family', 'y': 'Count'},
title="Distribution of Instrument Families",
template="plotly"
)
fig.show()
The Bass family has the highest count, standing out significantly with around 250 occurrences, followed by the Keyboard and Mallet families, which have moderate representation. In contrast, instrument families like Synth Lead, Flute, and Brass have the lowest counts, which indicates a smaller presence in the dataset. This distribution suggests that the dataset emphasizes bass and keyboard instruments, while certain families are underrepresented, which could impact tasks like classification or model training.
Mel Spectrogram Analysis
Now, let’s analyze the Mel spectrogram, which translates audio frequencies into the Mel scale, to simulate human perception of sound:
mel_spectrogram = librosa.feature.melspectrogram(y=audio_np, sr=16000, n_mels=128)
mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
fig = go.Figure(data=go.Heatmap(
z=mel_spectrogram_db,
x=time,
y=np.linspace(0, 16000 / 2, mel_spectrogram_db.shape[0]),
colorscale='Viridis',
colorbar=dict(title="Amplitude (dB)")
))
fig.show()
The Mel spectrogram highlights that the audio signal has a prominent frequency component at 6,000 Hz, which remains sustained throughout the clip. This strong frequency presence, represented by the yellow-green band, indicates a stable tone with high energy concentrated at this frequency. Lower frequencies below 1,000 Hz display much weaker energy, which suggests minimal low-pitched content.
MFCC Analysis
Now, let’s analyze the MFCCs (Mel-Frequency Cepstral Coefficients), which are commonly used in audio feature extraction:
mfccs = librosa.feature.mfcc(y=audio_np, sr=16000, n_mfcc=13)
fig = go.Figure(data=go.Heatmap(
z=mfccs,
x=time,
y=np.arange(1, mfccs.shape[0] + 1),
colorscale='Viridis',
colorbar=dict(title="MFCC Value")
))
fig.show()
The MFCC graph highlights the spectral features of the audio signal over time. The first MFCC coefficient (bottom row) shows a significantly lower magnitude compared to the others, indicating it captures the signal’s overall energy. The remaining coefficients exhibit relatively uniform values across time, which suggests the audio signal has a stable frequency content without major timbral variations.
Transforming Audio Data
Now, let’s see how to apply audio transformations like pitch shifting and time stretching:
# apply pitch shift (+2 semitones) audio_pitch_shifted = librosa.effects.pitch_shift(audio_np, sr=16000, n_steps=2) # apply time-stretching (speed up by 1.5x) audio_time_stretched = librosa.effects.time_stretch(audio_np, rate=1.5) # plot waveforms fig = go.Figure() fig.add_trace(go.Scatter(y=audio_np, mode='lines', name='Original')) fig.add_trace(go.Scatter(y=audio_pitch_shifted, mode='lines', name='Pitch Shifted')) fig.add_trace(go.Scatter(y=audio_time_stretched, mode='lines', name='Time Stretched')) fig.show()

The original waveform (blue) maintains its natural decay, while the pitch-shifted version (red) closely follows the same shape but with slight variations due to the pitch adjustment. The time-stretched version (green) has a broader waveform, indicating a slower playback speed. These transformations highlight the ability to manipulate audio for pitch and duration while preserving its overall structure, which is essential for tasks like audio augmentation and synthesis.
Summary
In this article, we explored the fundamentals of audio data processing and analysis using the NSynth dataset. From visualizing waveforms and spectrograms to understanding instrument distributions and extracting features like MFCCs, we uncovered valuable insights into audio structures and frequencies. Additionally, we applied transformations like pitch shifting and time stretching to demonstrate audio augmentation techniques.
I hope you liked this article on audio data processing and analysis with Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





