Time series data is a sequence of data points collected or recorded at regular time intervals. Some common examples of time series include daily stock prices, monthly weather data, and yearly sales figures. If you want to learn how to work with time series data, this article is for you. In this article, I’ll take you through a complete guide to Time Series Data Manipulation using Python.
Time Series Data Manipulation using Python
Manipulating time series data involves several steps, including loading the data, parsing time information, handling missing values, resampling for different time periods, and potentially transforming or aggregating the data.
Let’s get started with Time Series Data Manipulation using Python by importing a time series dataset:
import pandas as pd
# Load the dataset
time_series_data = pd.read_csv("INR-USD.csv")
# Display the first few rows of the dataset
print(time_series_data.head())Date Open High Low Close Adj Close Volume
0 2003-12-01 45.709000 45.728001 45.449001 45.480000 45.480000 0.0
1 2003-12-08 45.474998 45.507999 45.352001 45.451000 45.451000 0.0
2 2003-12-15 45.450001 45.500000 45.332001 45.455002 45.455002 0.0
3 2003-12-22 45.417000 45.549000 45.296001 45.507999 45.507999 0.0
4 2003-12-29 45.439999 45.645000 45.421001 45.560001 45.560001 0.0
The dataset consists of columns like ‘Date’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Adj Close’, and ‘Volume’. These columns are typical in financial time series datasets, where:
- Date: Represents the time at which the data was recorded.
- Open: The price at the beginning of the trading period.
- High: The highest price during the trading period.
- Low: The lowest price during the trading period.
- Close: The price at the end of the trading period.
- Adj Close: The closing price after adjustments for all applicable splits and dividend distributions.
- Volume: Trade Volume.
The next step is to ensure that the ‘Date’ column is correctly recognized as a date. It is crucial for time series analysis, as it allows the data to be manipulated and queried based on time. Let’s parse the ‘Date’ column and set it as the index of the DataFrame:
# Parsing the 'Date' column and setting it as the index
time_series_data['Date'] = pd.to_datetime(time_series_data['Date'])
time_series_data.set_index('Date', inplace=True)
# Display the DataFrame with the parsed dates
print(time_series_data.head())Open High Low Close Adj Close Volume
Date
2003-12-01 45.709000 45.728001 45.449001 45.480000 45.480000 0.0
2003-12-08 45.474998 45.507999 45.352001 45.451000 45.451000 0.0
2003-12-15 45.450001 45.500000 45.332001 45.455002 45.455002 0.0
2003-12-22 45.417000 45.549000 45.296001 45.507999 45.507999 0.0
2003-12-29 45.439999 45.645000 45.421001 45.560001 45.560001 0.0
Now that the ‘Date’ column is correctly parsed and set as the index, the next step is to inspect the data for any missing or anomalous values (using descriptive statistics):
# Checking for missing values missing_values = time_series_data.isnull().sum() print(missing_values)
Open 3
High 3
Low 3
Close 3
Adj Close 3
Volume 3
dtype: int64
basic_statistics = time_series_data.describe() print(basic_statistics)
Open High Low Close Adj Close Volume
count 1013.000000 1013.000000 1013.000000 1013.000000 1013.000000 1013.0
mean 58.035208 58.506681 57.654706 58.056509 58.056509 0.0
std 12.614635 12.716632 12.565279 12.657407 12.657407 0.0
min 38.995998 39.334999 38.979000 39.044998 39.044998 0.0
25% 45.508999 45.775002 45.231998 45.498001 45.498001 0.0
50% 59.702999 60.342999 59.209999 59.840000 59.840000 0.0
75% 68.508499 69.099998 68.250000 68.538002 68.538002 0.0
max 82.917999 83.386002 82.563004 82.932999 82.932999 0.0
There are three missing values in each of the columns except for the ‘Date’ column. The descriptive statistics do not immediately suggest the presence of extreme anomalies or outliers.
There are several strategies to handle missing values, including:
- Dropping Rows: If the dataset is large and the number of rows with missing data is small, we can choose to remove the rows with missing values.
- Imputation: Replace missing values with a substitute value, like the mean, median, or value estimated by a model.
Given the small number of missing values in this dataset, either approach could be suitable. For demonstration, I’ll choose to fill the missing values with the mean of their respective columns. So, let’s proceed with handling the missing values and then move to the next steps of resampling and aggregating the data:
# Handling missing values by imputing with the mean time_series_data_filled = time_series_data.fillna(time_series_data.mean()) # Verify that there are no more missing values missing_values_after = time_series_data_filled.isnull().sum() print(missing_values_after)
Open 0
High 0
Low 0
Close 0
Adj Close 0
Volume 0
dtype: int64
The missing values have been successfully filled with the mean of their respective columns, and there are no more missing values in the dataset.
Resampling and Aggregating
Resampling is a common method in time series analysis, particularly when dealing with financial data. It involves changing the frequency of your time series observations.
For instance, you can aggregate daily data into monthly or yearly data. It is often done for two reasons: to make the dataset more manageable or to observe longer-term trends. There are two types of resampling:
- Downsampling: Decreasing the frequency of the samples, e.g., from days to months.
- Upsampling: Increasing the frequency of the samples, e.g., from months to days.
In each case, you might need to aggregate the data using methods such as taking the mean, sum, or median. For this dataset, let’s demonstrate downsampling by aggregating the weekly data into monthly data and computing the mean for each month.
After resampling, we can also look at specific time periods more closely, if needed. Let’s perform the resampling:
# Resampling the data from weekly to monthly frequency
monthly_resampled_data = time_series_data_filled.resample('M').mean()
print(monthly_resampled_data.head())Open High Low Close Adj Close Volume
Date
2003-12-31 45.498200 45.586000 45.370001 45.49080 45.49080 0.0
2004-01-31 45.365999 45.433750 45.172999 45.23675 45.23675 0.0
2004-02-29 45.167251 45.285999 45.106251 45.27450 45.27450 0.0
2004-03-31 45.026801 45.073800 44.558600 44.75880 44.75880 0.0
2004-04-30 43.767750 44.049250 43.630000 43.90800 43.90800 0.0
Feature Extraction from Time Series
Extracting relevant features from time series data is a crucial step, particularly if you’re preparing the data for time series forecasting or understanding underlying patterns.
Feature extraction involves creating new variables that capture important aspects of the time series. These features can be based on the time stamps themselves or derived from the data values. Some common time series features include:
- Date-Time Features: Extract components such as year, month, day of the week, etc., from the date-time index.
- Lag Features: Creating features that represent values from previous time steps.
- Rolling Window Statistics: Calculating statistics (e.g., mean, median, standard deviation) over a moving window of time.
I’ll demonstrate these techniques using the monthly resampled data. Let’s start with extracting date-time features:
# Extracting date-time features
monthly_resampled_data['Year'] = monthly_resampled_data.index.year
monthly_resampled_data['Month'] = monthly_resampled_data.index.month
monthly_resampled_data['Quarter'] = monthly_resampled_data.index.quarter
monthly_resampled_data['WeekOfYear'] = monthly_resampled_data.index.isocalendar().week
# Creating lag features for 'Close' price
monthly_resampled_data['Close_Lag1'] = monthly_resampled_data['Close'].shift(1)
monthly_resampled_data['Close_Lag2'] = monthly_resampled_data['Close'].shift(2)
# Rolling window statistics for 'Close' price (using a window of 3 months)
monthly_resampled_data['Close_Rolling_Mean'] = monthly_resampled_data['Close'].rolling(window=3).mean()
monthly_resampled_data['Close_Rolling_Std'] = monthly_resampled_data['Close'].rolling(window=3).std()
# Displaying the DataFrame with the new features
print(monthly_resampled_data[['Year', 'Month', 'Quarter', 'WeekOfYear',
'Close_Lag1', 'Close_Lag2',
'Close_Rolling_Mean', 'Close_Rolling_Std']].head())Year Month Quarter WeekOfYear Close_Lag1 Close_Lag2 \
Date
2003-12-31 2003 12 4 1 NaN NaN
2004-01-31 2004 1 1 5 45.49080 NaN
2004-02-29 2004 2 1 9 45.23675 45.49080
2004-03-31 2004 3 1 14 45.27450 45.23675
2004-04-30 2004 4 2 18 44.75880 45.27450
Close_Rolling_Mean Close_Rolling_Std
Date
2003-12-31 NaN NaN
2004-01-31 NaN NaN
2004-02-29 45.334017 0.137084
2004-03-31 45.090017 0.287462
2004-04-30 44.647100 0.690064
The additional features have been successfully extracted and added to the dataset. These features provide a richer dataset for analysis or modelling. For example:
- Date-Time Features can be useful to identify seasonal patterns or trends over different time scales (yearly, quarterly, etc.).
- Lag Features are crucial in forecasting models where past values are strong indicators of future ones.
- Rolling Window Statistics offer insights into the short-term variability and trends of the data.
So, this is how we can perform Time Series Data Manipulation using Python. If you want to learn how to perform Time Series Forecasting, you can find a detailed article on it here.
Summary
Manipulating time series data involves several steps including loading the data, parsing time information, handling missing values, resampling for different time periods, and potentially transforming or aggregating the data. I hope you liked this article on a guide to Time Series Data Manipulation using Python. Feel free to ask valuable questions in the comments section below.





