Granger Causality Test in Python

Granger Causality test is a statistical test that is used to determine if a given time series and it’s lags is helpful in explaining the value of another series. You can implement this in Python using the statsmodels package.

That is, the Granger Causality can be used to check if a given series is a leading indicator of a series we want to forecast/predict.

How to implement Granger Causality Test in Python

You can implement Granger Causality test using the grangercausalitytests function present in the statsmodels library.

Let’s start by importing the packages.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
%matplotlib inline

Read the dataset: BJsales

df = pd.read_csv("BJsales.csv")
(150, 2)
BJSales Bjsales_lead
0 200.1 10.01
1 199.5 10.07
2 199.4 10.32
3 198.9 9.75
4 199.0 10.33

Visualize the Lead effect in the time series

x = df.index
y1 = df['BJSales']
y2 = df['Bjsales_lead']

# Plot Line1 (Left Y Axis)
fig, ax1 = plt.subplots(1,1,figsize=(16,9), dpi= 80)
ax1.plot(x, y1, color='tab:red')

# Plot Line2 (Right Y Axis)
ax2 = ax1.twinx()  # instantiate a second axes that shares the same x-axis
ax2.plot(x, y2, color='tab:blue')

# Decorations
# ax1 (left Y axis)
ax1.set_xlabel('Time', fontsize=20)
ax1.tick_params(axis='x', rotation=0, labelsize=12)
ax1.set_ylabel('BJSales', color='tab:red', fontsize=20)
ax1.tick_params(axis='y', rotation=0, labelcolor='tab:red' )

# ax2 (right Y axis)
ax2.set_ylabel("BJSales Lead", color='tab:blue', fontsize=20)
ax2.tick_params(axis='y', labelcolor='tab:blue')
ax2.set_xticks(np.arange(0, len(x), 60))
ax2.set_xticklabels(x[::60], rotation=90, fontdict={'fontsize':10})
ax2.set_title("Visualizing Leading Indicator Phenomenon", fontsize=22)

leading_indicator effect in timeseries

The above chart is a template borrowed from the Top 50 Matplotlib Visualizations.

The Granger Causality Test Function in Python Statsmodels

from statsmodels.tsa.stattools import grangercausalitytests

Let’s look at the help of grangercausalitytests

Signature: grangercausalitytests(x, maxlag, addconst=True, verbose=None)
Four tests for granger non causality of 2 time series.

All four tests give similar results. `params_ftest` and `ssr_ftest` are
equivalent based on F test which is identical to lmtest:grangertest in R.

x : array_like
    The data for testing whether the time series in the second column Granger
    causes the time series in the first column. Missing values are not
maxlag : {int, Iterable[int]}
    If an integer, computes the test for all lags up to maxlag. If an
    iterable, computes the tests only for the lags in maxlag.
addconst : bool
    Include a constant in the model.
verbose : bool
    Print results. Deprecated

    .. deprecated: 0.14

       verbose is deprecated and will be removed after 0.15 is released

    All test results, dictionary keys are the number of lags. For each
    lag the values are a tuple, with the first element a dictionary with
    test statistic, pvalues, degrees of freedom, the second element are
    the OLS estimation results for the restricted model, the unrestricted
    model and the restriction (contrast) matrix for the parameter f_test.

The Null hypothesis for grangercausalitytests is that the time series in
the second column, x2, does NOT Granger cause the time series in the first
column, x1. Grange causality means that past values of x2 have a
statistically significant effect on the current value of x1, taking past
values of x1 into account as regressors. We reject the null hypothesis
that x2 does not Granger cause x1 if the pvalues are below a desired size
of the test.

The null hypothesis for all four test is that the coefficients
corresponding to past values of the second time series are zero.

`params_ftest`, `ssr_ftest` are based on F distribution

`ssr_chi2test`, `lrtest` are based on chi-square distribution

>>> import statsmodels.api as sm
>>> from statsmodels.tsa.stattools import grangercausalitytests
>>> import numpy as np
>>> data = sm.datasets.macrodata.load_pandas()
>>> data =[["realgdp", "realcons"]].pct_change().dropna()

All lags up to 4
>>> gc_res = grangercausalitytests(data, 4)

Only lag 4
>>> gc_res = grangercausalitytests(data, [4])    

Now, define the maximum number of lags to include in the test.

maxlag = 12
test   = 'ssr_chi2test'

As stated, the Null hypothesis is that the BJSales Lead, does NOT Granger cause BJSales.

Run the Granger Causality Test

The grangers_causation_matrix function defined below is designed to perform the test on multiple combinations of variables in one shot. That is, it tries to perform the Granger Causality test for all pairs of variables present in the input dataset.

def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):    
    """Check Granger Causality of all possible combinations of the Time series.
    The rows are the response variable, columns are predictors. The values in the table 
    are the P-Values. P-Values lesser than the significance level (0.05), implies 
    the Null Hypothesis that the coefficients of the corresponding past values is 
    zero, that is, the X does not cause Y can be rejected.

    data      : pandas dataframe containing the time series variables
    variables : list containing names of the time series variables.
    df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
    for c in df.columns:
        for r in df.index:
            test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
            p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
            if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
            min_p_value = np.min(p_values)
            df.loc[r, c] = min_p_value
    df.columns = [var + '_x' for var in variables]
    df.index = [var + '_y' for var in variables]
    return df

The output contains the p-values of the Granger causality test, with the X and Y variables shown in the columns and rows respectively.

grangers_causation_matrix(df, variables = df.columns)  
BJSales_x Bjsales_lead_x
BJSales_y 1.0000 0.0
Bjsales_lead_y 0.0002 1.0

How to interpret the p-values?

Assuming a significance level of 0.05, if the p-value is lesser than 0.05, then we do NOT reject the null hypothesis that X does NOT granger cause Y.

So, in the above table, the p-value for X=Bjsales_lead and Y=BJSales is 0.0. So we reject the null hypothesis and conclude that X (BJSales_Lead) granger causes Y (BJSales).

That means, BJSales_lead will likely be helpful in predicting the BJSales, which confirms what we saw in the dual times series chart earlier.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science