Time Series Forecasting with ARIMA

Time Series Forecasting means analyzing and modeling time-series data to make future decisions. Some of the applications of Time Series Forecasting are weather forecasting, sales forecasting, business forecasting, stock price forecasting, etc. The ARIMA model is a popular statistical technique used for Time Series Forecasting. If you want to learn Time Series Forecasting with ARIMA, this article is for you. In this article, I will take you through the task of Time Series Forecasting with ARIMA using the Python programming language.

What is ARIMA?

ARIMA stands for Autoregressive Integrated Moving Average. It is an algorithm used for forecasting Time Series Data. ARIMA models have three parameters like ARIMA(p, d, q). Here p, d, and q are defined as:

p is the number of lagged values that need to be added or subtracted from the values (label column). It captures the autoregressive part of ARIMA.
d represents the number of times the data needs to differentiate to produce a stationary signal. If it’s stationary data, the value of d should be 0, and if it’s seasonal data, the value of d should be 1. d captures the integrated part of ARIMA.
q is the number of lagged values for the error term added or subtracted from the values (label column). It captures the moving average part of ARIMA.

I hope you have now understood the ARIMA model. In the section below, I will take you through the task of Time Series Forecasting of stock prices with ARIMA using the Python programming language.

Time Series Forecasting with ARIMA

Now let’s start with the task of Time Series Forecasting with ARIMA. I will first collect Google stock price data using the Yahoo Finance API. If you have never used Yahoo Finance API, you can learn more about it here.

Now here’s how to collect data about the Google’s Stock Price:

import pandas as pd
import yfinance as yf
import datetime
from datetime import date, timedelta
today = date.today()

d1 = today.strftime("%Y-%m-%d")
end_date = d1
d2 = date.today() - timedelta(days=365)
d2 = d2.strftime("%Y-%m-%d")
start_date = d2

data = yf.download('GOOG', 
                      start=start_date, 
                      end=end_date, 
                      progress=False)
data["Date"] = data.index
data = data[["Date", "Open", "High", "Low", "Close", "Adj Close", "Volume"]]
data.reset_index(drop=True, inplace=True)
print(data.tail())

          Date         Open         High          Low        Close  \
247 2022-06-13  2148.919922  2184.370117  2131.760986  2137.530029   
248 2022-06-14  2137.800049  2169.149902  2127.040039  2143.879883   
249 2022-06-15  2177.989990  2241.260010  2162.375000  2207.810059   
250 2022-06-16  2162.989990  2185.810059  2115.850098  2132.719971   
251 2022-06-17  2130.699951  2184.989990  2112.571045  2157.310059   

       Adj Close   Volume  
247  2137.530029  1837800  
248  2143.879883  1274000  
249  2207.810059  1659600  
250  2132.719971  1765700  
251  2157.310059  2163500

We only need the date and close prices columns for the rest of the task, so let’s select both the columns and move further:

data = data[["Date", "Close"]]
print(data.head())

        Date        Close
0 2021-06-21  2529.100098
1 2021-06-22  2539.989990
2 2021-06-23  2529.229980
3 2021-06-24  2545.639893
4 2021-06-25  2539.899902

Now let’s visualize the close prices of Google before moving forward:

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.figure(figsize=(15, 10))
plt.plot(data["Date"], data["Close"])

Using ARIMA for Time Series Forecasting

Before using the ARIMA model, we have to figure out whether our data is stationary or seasonal. The data visualization graph about the closing stock prices above shows that our dataset is not stationary. To check whether our dataset is stationary or seasonal properly, we can use the seasonal decomposition method that splits the time series data into trend, seasonal, and residuals for a better understanding of the time series data:

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data["Close"], 
                            model='multiplicative', freq = 30)
fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(15, 10)

seasonal decomposition: Time Series Forecasting with ARIMA

So our data is not stationary it is seasonal. We need to use the Seasonal ARIMA (SARIMA) model for Time Series Forecasting on this data. But before using the SARIMA model, we will use the ARIMA model. It will help you learn using both models.

To use ARIMA or SARIMA, we need to find the p, d, and q values. We can find the value of p by plotting the autocorrelation of the Close column and the value of q by plotting the partial autocorrelation plot. The value of d is either 0 or 1. If the data is stationary, we should use 0, and if the data is seasonal, we should use 1. As our data is seasonal, we should use 1 as the d value.

Now here’s how to find the value of p:

pd.plotting.autocorrelation_plot(data["Close"])

In the above autocorrelation plot, the curve is moving down after the 5th line of the first boundary. That is how to decide the p-value. Hence the value of p is 5. Now let’s find the value of q (moving average):

from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(data["Close"], lags = 100)

In the above partial autocorrelation plot, we can see that only two points are far away from all the points. That is how to decide the q value. Hence the value of q is 2. Now let’s build an ARIMA model:

p, d, q = 5, 1, 2
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(data["Close"], order=(p,d,q))  
fitted = model.fit(disp=-1)  
print(fitted.summary())

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                D.Close   No. Observations:                  251
Model:                 ARIMA(5, 1, 2)   Log Likelihood               -1328.041
Method:                       css-mle   S.D. of innovations             48.034
Date:                Tue, 21 Jun 2022   AIC                           2674.083
Time:                        06:12:58   BIC                           2705.812
Sample:                             1   HQIC                          2686.851
                                                                              
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -1.5031      2.251     -0.668      0.505      -5.914       2.908
ar.L1.D.Close     0.0443      0.243      0.182      0.856      -0.432       0.520
ar.L2.D.Close     0.7582      0.204      3.712      0.000       0.358       1.158
ar.L3.D.Close    -0.0690      0.079     -0.870      0.385      -0.224       0.086
ar.L4.D.Close    -0.0623      0.069     -0.901      0.369      -0.198       0.073
ar.L5.D.Close     0.0992      0.075      1.327      0.186      -0.047       0.246
ma.L1.D.Close    -0.0923      0.234     -0.394      0.694      -0.552       0.367
ma.L2.D.Close    -0.7388      0.191     -3.877      0.000      -1.112      -0.365
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1301           -0.0000j            1.1301           -0.0000
AR.2           -1.4091           -0.2578j            1.4325           -0.4712
AR.3           -1.4091           +0.2578j            1.4325            0.4712
AR.4            1.1583           -1.7339j            2.0852           -0.1563
AR.5            1.1583           +1.7339j            2.0852            0.1563
MA.1            1.1026           +0.0000j            1.1026            0.0000
MA.2           -1.2276           +0.0000j            1.2276            0.5000
-----------------------------------------------------------------------------

Here’s how to predict the values using the ARIMA model:

predictions = fitted.predict()
print(predictions)

2     -2.108482
3     -0.789990
4     -3.688940
5     -0.777623
6     -2.472432
         ...   
247    2.866723
248    2.486679
249    7.659670
250    5.277199
251    8.960482
Length: 250, dtype: float64

The predicted values are wrong because the data is seasonal. ARIMA model will never perform well on seasonal time series data. So, here’s how to build a SARIMA model:

import statsmodels.api as sm
import warnings
model=sm.tsa.statespace.SARIMAX(data['Close'],
                                order=(p, d, q),
                                seasonal_order=(p, d, q, 12))
model=model.fit()
print(model.summary())

                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                              Close   No. Observations:                  252
Model:             SARIMAX(5, 1, 2)x(5, 1, 2, 12)   Log Likelihood               -1280.516
Date:                            Tue, 21 Jun 2022   AIC                           2591.032
Time:                                    06:15:00   BIC                           2643.179
Sample:                                         0   HQIC                          2612.046
                                            - 252                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.0803      3.857     -0.021      0.983      -7.639       7.479
ar.L2          0.9622      3.583      0.269      0.788      -6.060       7.984
ar.L3         -0.0029      0.182     -0.016      0.987      -0.360       0.354
ar.L4          0.0123      0.193      0.064      0.949      -0.365       0.390
ar.L5          0.0586      0.249      0.236      0.814      -0.429       0.546
ma.L1          0.0256      3.032      0.008      0.993      -5.918       5.969
ma.L2         -0.9726      2.979     -0.327      0.744      -6.811       4.866
ar.S.L12       0.2082      0.783      0.266      0.790      -1.327       1.743
ar.S.L24       0.1491      0.086      1.738      0.082      -0.019       0.317
ar.S.L36      -0.0226      0.182     -0.124      0.901      -0.379       0.334
ar.S.L48      -0.1415      0.089     -1.595      0.111      -0.315       0.032
ar.S.L60      -0.0981      0.132     -0.744      0.457      -0.356       0.160
ma.S.L12      -1.2637      0.717     -1.762      0.078      -2.669       0.142
ma.S.L24       0.2782      0.759      0.367      0.714      -1.210       1.766
sigma2      2203.0788   1934.635      1.139      0.255   -1588.737    5994.894
===================================================================================
Ljung-Box (Q):                       29.16   Jarque-Bera (JB):                21.53
Prob(Q):                              0.90   Prob(JB):                         0.00
Heteroskedasticity (H):               2.69   Skew:                             0.15
Prob(H) (two-sided):                  0.00   Kurtosis:                         4.44
===================================================================================

Now let’s predict the future stock prices using the SARIMA model for the next 10 days:

predictions = model.predict(len(data), len(data)+10)
print(predictions)

252    2155.450727
253    2174.383879
254    2138.454522
255    2118.298381
256    2117.235728
257    2112.857380
258    2099.387811
259    2085.703155
260    2117.912628
261    2133.935300
262    2168.589946
dtype: float64

Here’s how you can plot the predictions:

data["Close"].plot(legend=True, label="Training Data", figsize=(15, 10))
predictions.plot(legend=True, label="Predictions")

Predictions: Time Series Forecasting with ARIMA

So this is how you can use ARIMA or SARIMA models for Time Series Forecasting using Python.

Summary

ARIMA stands for Autoregressive Integrated Moving Average. It is an algorithm used for forecasting Time Series Data. If the data is stationary, we need to use ARIMA, if the data is seasonal, we need to use Seasonal ARIMA (SARIMA). I hope you liked this article about Time Series Forecasting with ARIMA using Python. Feel free to ask valuable questions in the comments section below.