Time Series Data Manipulation using Python

Time series data is a sequence of data points collected or recorded at regular time intervals. Some common examples of time series include daily stock prices, monthly weather data, and yearly sales figures. If you want to learn how to work with time series data, this article is for you. In this article, I’ll take you through a complete guide to Time Series Data Manipulation using Python.

Time Series Data Manipulation using Python

Manipulating time series data involves several steps, including loading the data, parsing time information, handling missing values, resampling for different time periods, and potentially transforming or aggregating the data.

Let’s get started with Time Series Data Manipulation using Python by importing a time series dataset:

import pandas as pd

# Load the dataset
time_series_data = pd.read_csv("INR-USD.csv")

# Display the first few rows of the dataset
print(time_series_data.head())

         Date       Open       High        Low      Close  Adj Close  Volume
0  2003-12-01  45.709000  45.728001  45.449001  45.480000  45.480000     0.0
1  2003-12-08  45.474998  45.507999  45.352001  45.451000  45.451000     0.0
2  2003-12-15  45.450001  45.500000  45.332001  45.455002  45.455002     0.0
3  2003-12-22  45.417000  45.549000  45.296001  45.507999  45.507999     0.0
4  2003-12-29  45.439999  45.645000  45.421001  45.560001  45.560001     0.0

The dataset consists of columns like ‘Date’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Adj Close’, and ‘Volume’. These columns are typical in financial time series datasets, where:

Date: Represents the time at which the data was recorded.
Open: The price at the beginning of the trading period.
High: The highest price during the trading period.
Low: The lowest price during the trading period.
Close: The price at the end of the trading period.
Adj Close: The closing price after adjustments for all applicable splits and dividend distributions.
Volume: Trade Volume.

The next step is to ensure that the ‘Date’ column is correctly recognized as a date. It is crucial for time series analysis, as it allows the data to be manipulated and queried based on time. Let’s parse the ‘Date’ column and set it as the index of the DataFrame:

# Parsing the 'Date' column and setting it as the index
time_series_data['Date'] = pd.to_datetime(time_series_data['Date'])
time_series_data.set_index('Date', inplace=True)

# Display the DataFrame with the parsed dates
print(time_series_data.head())

                 Open       High        Low      Close  Adj Close  Volume
Date                                                                     
2003-12-01  45.709000  45.728001  45.449001  45.480000  45.480000     0.0
2003-12-08  45.474998  45.507999  45.352001  45.451000  45.451000     0.0
2003-12-15  45.450001  45.500000  45.332001  45.455002  45.455002     0.0
2003-12-22  45.417000  45.549000  45.296001  45.507999  45.507999     0.0
2003-12-29  45.439999  45.645000  45.421001  45.560001  45.560001     0.0

Now that the ‘Date’ column is correctly parsed and set as the index, the next step is to inspect the data for any missing or anomalous values (using descriptive statistics):

# Checking for missing values
missing_values = time_series_data.isnull().sum()
print(missing_values)

Open         3
High         3
Low          3
Close        3
Adj Close    3
Volume       3
dtype: int64

basic_statistics = time_series_data.describe()
print(basic_statistics)

              Open         High          Low        Close    Adj Close  Volume
count  1013.000000  1013.000000  1013.000000  1013.000000  1013.000000  1013.0
mean     58.035208    58.506681    57.654706    58.056509    58.056509     0.0
std      12.614635    12.716632    12.565279    12.657407    12.657407     0.0
min      38.995998    39.334999    38.979000    39.044998    39.044998     0.0
25%      45.508999    45.775002    45.231998    45.498001    45.498001     0.0
50%      59.702999    60.342999    59.209999    59.840000    59.840000     0.0
75%      68.508499    69.099998    68.250000    68.538002    68.538002     0.0
max      82.917999    83.386002    82.563004    82.932999    82.932999     0.0

There are three missing values in each of the columns except for the ‘Date’ column. The descriptive statistics do not immediately suggest the presence of extreme anomalies or outliers.

There are several strategies to handle missing values, including:

Dropping Rows: If the dataset is large and the number of rows with missing data is small, we can choose to remove the rows with missing values.
Imputation: Replace missing values with a substitute value, like the mean, median, or value estimated by a model.

Given the small number of missing values in this dataset, either approach could be suitable. For demonstration, I’ll choose to fill the missing values with the mean of their respective columns. So, let’s proceed with handling the missing values and then move to the next steps of resampling and aggregating the data:

# Handling missing values by imputing with the mean
time_series_data_filled = time_series_data.fillna(time_series_data.mean())

# Verify that there are no more missing values
missing_values_after = time_series_data_filled.isnull().sum()

print(missing_values_after)

Open         0
High         0
Low          0
Close        0
Adj Close    0
Volume       0
dtype: int64

The missing values have been successfully filled with the mean of their respective columns, and there are no more missing values in the dataset.

Resampling and Aggregating

Resampling is a common method in time series analysis, particularly when dealing with financial data. It involves changing the frequency of your time series observations.

For instance, you can aggregate daily data into monthly or yearly data. It is often done for two reasons: to make the dataset more manageable or to observe longer-term trends. There are two types of resampling:

Downsampling: Decreasing the frequency of the samples, e.g., from days to months.
Upsampling: Increasing the frequency of the samples, e.g., from months to days.

In each case, you might need to aggregate the data using methods such as taking the mean, sum, or median. For this dataset, let’s demonstrate downsampling by aggregating the weekly data into monthly data and computing the mean for each month.

After resampling, we can also look at specific time periods more closely, if needed. Let’s perform the resampling:

# Resampling the data from weekly to monthly frequency
monthly_resampled_data = time_series_data_filled.resample('M').mean()

print(monthly_resampled_data.head())

                 Open       High        Low     Close  Adj Close  Volume
Date                                                                    
2003-12-31  45.498200  45.586000  45.370001  45.49080   45.49080     0.0
2004-01-31  45.365999  45.433750  45.172999  45.23675   45.23675     0.0
2004-02-29  45.167251  45.285999  45.106251  45.27450   45.27450     0.0
2004-03-31  45.026801  45.073800  44.558600  44.75880   44.75880     0.0
2004-04-30  43.767750  44.049250  43.630000  43.90800   43.90800     0.0

Feature Extraction from Time Series

Extracting relevant features from time series data is a crucial step, particularly if you’re preparing the data for time series forecasting or understanding underlying patterns.

Feature extraction involves creating new variables that capture important aspects of the time series. These features can be based on the time stamps themselves or derived from the data values. Some common time series features include:

Date-Time Features: Extract components such as year, month, day of the week, etc., from the date-time index.
Lag Features: Creating features that represent values from previous time steps.
Rolling Window Statistics: Calculating statistics (e.g., mean, median, standard deviation) over a moving window of time.

I’ll demonstrate these techniques using the monthly resampled data. Let’s start with extracting date-time features:

# Extracting date-time features
monthly_resampled_data['Year'] = monthly_resampled_data.index.year
monthly_resampled_data['Month'] = monthly_resampled_data.index.month
monthly_resampled_data['Quarter'] = monthly_resampled_data.index.quarter
monthly_resampled_data['WeekOfYear'] = monthly_resampled_data.index.isocalendar().week

# Creating lag features for 'Close' price
monthly_resampled_data['Close_Lag1'] = monthly_resampled_data['Close'].shift(1)
monthly_resampled_data['Close_Lag2'] = monthly_resampled_data['Close'].shift(2)

# Rolling window statistics for 'Close' price (using a window of 3 months)
monthly_resampled_data['Close_Rolling_Mean'] = monthly_resampled_data['Close'].rolling(window=3).mean()
monthly_resampled_data['Close_Rolling_Std'] = monthly_resampled_data['Close'].rolling(window=3).std()

# Displaying the DataFrame with the new features
print(monthly_resampled_data[['Year', 'Month', 'Quarter', 'WeekOfYear', 
                              'Close_Lag1', 'Close_Lag2', 
                              'Close_Rolling_Mean', 'Close_Rolling_Std']].head())

            Year  Month  Quarter  WeekOfYear  Close_Lag1  Close_Lag2  \
Date                                                                   
2003-12-31  2003     12        4           1         NaN         NaN   
2004-01-31  2004      1        1           5    45.49080         NaN   
2004-02-29  2004      2        1           9    45.23675    45.49080   
2004-03-31  2004      3        1          14    45.27450    45.23675   
2004-04-30  2004      4        2          18    44.75880    45.27450   

            Close_Rolling_Mean  Close_Rolling_Std  
Date                                               
2003-12-31                 NaN                NaN  
2004-01-31                 NaN                NaN  
2004-02-29           45.334017           0.137084  
2004-03-31           45.090017           0.287462  
2004-04-30           44.647100           0.690064

The additional features have been successfully extracted and added to the dataset. These features provide a richer dataset for analysis or modelling. For example:

Date-Time Features can be useful to identify seasonal patterns or trends over different time scales (yearly, quarterly, etc.).
Lag Features are crucial in forecasting models where past values are strong indicators of future ones.
Rolling Window Statistics offer insights into the short-term variability and trends of the data.

So, this is how we can perform Time Series Data Manipulation using Python. If you want to learn how to perform Time Series Forecasting, you can find a detailed article on it here.

Summary

Manipulating time series data involves several steps including loading the data, parsing time information, handling missing values, resampling for different time periods, and potentially transforming or aggregating the data. I hope you liked this article on a guide to Time Series Data Manipulation using Python. Feel free to ask valuable questions in the comments section below.