Menu

Cook’s Distance for Detecting Influential Observations

Cook’s distance is a measure computed to measure the influence exerted by each observation on the trained model. It is measured by building a regression model and therefore is impacted only by the X variables included in the model.

Written by Selva Prabhakaran | 4 min read

Cook’s distance is a measure computed to measure the influence exerted by each observation on the trained model. It is measured by building a regression model and therefore is impacted only by the X variables included in the model.

What is Cooks Distance?

Cook’s distance measures the influence exerted by each data point (row / observation) on the predicted outcome.

The cook’s distance for each observation ‘i’ measures the change in Ŷ (fitted Y) for all observations with and without the presence of observation ‘i’, so we know how much the observation ‘i’ impacted the fitted values.

So you build a model with and without the observation ‘i’ in the dataset and see how much the predicted values of all other observations change.

Mathematically, cook’s distance Di for observation i is computed as:

where,

  • Ŷ j is the value of jth fitted response when all the observations are included.
  • Ŷ j(i) is the value of jth fitted response, where the fit does not include observation i.
  • MSE is the mean squared error.
  • p is the number of coefficients in the regression model.

Pros and Cons

Pros:

  1. Considers all variables in the model when calling out influential points.
  2. Formula based, intuitive to understand.

Cons:

  1. To compute Cook’s distance of each row, it requires the model to be retrained. So, computationally expensive to apply this method to other algorithms besides linear regression.

Let’s Compute Cook’s distance in Python.

It can be directly computed using the Statsmodels package. But first, let’s build the regression model.

Step 1: Prepare Data

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
python
filename = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]

df = pd.read_csv(filename, names = headers)
df.head()
symbolingnormalized-lossesmakefuel-typeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-baseengine-sizefuel-systemborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
03?alfa-romerogasstdtwoconvertiblerwdfront88.6130mpfi3.472.689.01115000212713495
13?alfa-romerogasstdtwoconvertiblerwdfront88.6130mpfi3.472.689.01115000212716500
21?alfa-romerogasstdtwohatchbackrwdfront94.5152mpfi2.683.479.01545000192616500
32164audigasstdfoursedanfwdfront99.8109mpfi3.193.4010.01025500243013950
42164audigasstdfoursedan4wdfront99.4136mpfi3.193.408.01155500182217450

5 rows × 26 columns

Data cleaning

python
df = df.loc[:, ['symboling', 'wheel-base', 'engine-size', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price']]
df = df.replace('?', np.nan)
df = df.dropna()
df.reset_index(drop = True, inplace = True)
df.head()
symbolingwheel-baseengine-sizeborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
0388.61303.472.689.01115000212713495
1388.61303.472.689.01115000212716500
2194.51522.683.479.01545000192616500
3299.81093.193.4010.01025500243013950
4299.41363.193.408.01155500182217450

Step 2 : Train linear regression model

python
import statsmodels.api as sm

#define response variable
y = df['price']

#define explanatory variable
x = df[['symboling', 'wheel-base', 'engine-size', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg']]

Build the linear regression model.

python
#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y.astype(float), x.astype(float)).fit()

Step 3 : Calculate Cook’s distance

python
#suppress scientific notation
import numpy as np
np.set_printoptions(suppress=True)

#create instance of influence
influence = model.get_influence()

Get the influence (Cook’s distance) scores

python
#obtain Cook's distance for each observation
cooks = influence.cooks_distance

#display Cook's distances
print(cooks)

(array([0.00005047, 0.00794651, 0.00235295, 0.00115412, 0.00229406,
0.00001475, 0.00040342, 0.00195747, 0.03696988, 0.00405958,
0.00509101, 0.00570949, 0.00620111, 0.01803001, 0.00335913,
0.07252824, 0.02556543, 0.04585015, 0.00000023, 0.00011847,
0.00000804, 0.00001075, 0.00001616, 0.00002787, 0.00000096,
0.00017893, 0.00003149, 0.00153593, 0.00321647, 0.05605546,
0.00023964, 0.0002776 , 0.00017649, 0.00000042, 0.00005717,
0.00005307, 0.00111793, 0.00011928, 0.00024373, 0.00013439,
0.00343908, 0.00001979, 0.00038317, 0.00006676, 0.00001466,
0.01387144, 0.67290488, 0.00029421, 0.00003223, 0.00022211,
0.00018456, 0.00050611, 0.00026183, 0.00039154, 0.000035 ,
0.0000165 , 0.00224472, 0.00031293, 0.00134812, 0.01269743,
0.00220525, 0.01360418, 0.01752224, 0.03048843, 0.00608054,
0.0820174 , 0.00601548, 0.06747548, 0.00295536, 0.00032691,
0.00016934, 0.00003394, 0.00003416, 0.00056306, 0.00151128,
0.00449227, 0.00008387, 0.00036013, 0.0020883 , 0.00080219,
0.00052908, 0.00073918, 0.00012555, 0.0017916 , 0.00000065,
0.00000839, 0.00006592, 0.00005771, 0.00016434, 0.00009382,
0.00022227, 0.00031158, 0.00052463, 0.00020926, 0.02253932,
0.01824963, 0.02049076, 0.007906 , 0.00873687, 0.00393047,
0.00092289, 0.01260195, 0.0056554 , 0.02883812, 0.00116899,
0.00057631, 0.00373692, 0.00311193, 0.00334457, 0.00000108,
0.00000342, 0.00000767, 0.00001616, 0.00002787, 0.00000096,
0.00000048, 0.00152386, 0.00403854, 0.01077807, 0.03652447,
0.06101313, 0.1287357 , 0.00197229, 0.00068141, 0.04555418,
0.00074255, 0.00214108, 0.00346266, 0.00702787, 0.00057118,
0.00005023, 0.00305842, 0.00066841, 0.00057935, 0.00003855,
0.00019247, 0.00160212, 0.00039846, 0.00206929, 0.00586846,
0.00000056, 0.00001257, 0.00003468, 0.00032255, 0.00158363,
0.00318834, 0.00000002, 0.00001426, 0.00186318, 0.00721941,
0.00078872, 0.00046616, 0.00120741, 0.00042025, 0.00052502,
0.0057842 , 0.00466251, 0.00925114, 0.00592447, 0.00508653,
0.00268166, 0.00212835, 0.00277433, 0.00000584, 0.0008918 ,
0.00041052, 0.00181117, 0.00261463, 0.02421767, 0.02716114,
0.01624373, 0.01152021, 0.00650464, 0.00052591, 0.00533062,
0.00039766, 0.00025162, 0.00000374, 0.00016362, 0.00156563,
0.00007099, 0.00105815, 0.01155108, 0.00050471, 0.00767723,
0.00446398, 0.00002331, 0.00001264, 0.00493016, 0.00522982,
0.00044824, 0.00037132, 0.00378943, 0.0084261 , 0.01092127]), array([1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 0.99999948, 1. ,
1. , 1. , 1. , 0.99999999, 1. ,
0.99998228, 0.99999993, 0.99999838, 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 0.99999534,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 0.76302816, 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 0.99999999, 0.99999981, 1. ,
0.99996672, 1. , 0.99998779, 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 0.99999996,
0.99999999, 0.99999998, 1. , 1. , 1. ,
1. , 1. , 1. , 0.99999986, 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 0.99999951,
0.99999276, 0.99968315, 1. , 1. , 0.99999843,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 0.99999995, 0.9999999 ,
0.99999999, 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ,
1. , 1. , 1. , 1. , 1. ]))

Step 4: Visualize the cook’s distance and find influencial points

The rows which have cook’s distance greater than 4x of mean cook’s distance are known as influencial points

python
# Draw plot
plt.figure(figsize = (12, 8))
plt.scatter(df.index, cooks[0])
plt.plot(df.index, cooks[0], color='black')
plt.xlabel('Row Number', fontsize = 12)
plt.ylabel('Cooks Distance', fontsize = 12)
plt.title('Influencial Points', fontsize = 22)
plt.show()

Mean Cook’s distance

python
mean_cooks = np.mean(cooks[0])
mean_cooks

0.00990911857380304

python
mean_cooks_list = [4*mean_cooks for i in df.index]
# mean_cooks_list

Visualize the Influence scores

python
# Draw plot
plt.figure(figsize = (12, 8))
plt.scatter(df.index, cooks[0])
plt.plot(df.index, mean_cooks_list, color="red")
plt.xlabel('Row Number', fontsize = 12)
plt.ylabel('Cooks Distance', fontsize = 12)
plt.title('Influencial Points', fontsize = 22)
plt.show()

python
# Influencial points
influencial_points = df.index[cooks[0] > 4*mean_cooks]
influencial_points

Int64Index([15, 17, 29, 46, 65, 67, 120, 121, 124], dtype=’int64′)

python
df.iloc[influencial_points, :]
symbolingwheel-baseengine-sizeborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
150103.52093.623.398.01825400162241315
17288.4612.913.039.548510047535151
29286.6922.913.419.658480049546479
460102.03263.542.7611.52625000131736000
65396.62343.463.108.31554750161835056
671112.03043.803.358.01844500141645400
120389.51943.742.909.52075900172534028
121389.51943.742.909.52075900172537028
124399.11212.542.079.31105250212815040

Let’s look at non influential points as well

When you compare, some features of the influential data points are towards an extreme.

python
# Non Influencial points
noninfluencial_points = df.index[cooks[0] < 4*mean_cooks]
noninfluencial_points
df.iloc[noninfluencial_points, :].head()
symbolingwheel-baseengine-sizeborestrokecompression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgprice
0388.61303.472.689.01115000212713495
1388.61303.472.689.01115000212716500
2194.51522.683.479.01545000192616500
3299.81093.193.4010.01025500243013950
4299.41363.193.408.01155500182217450
Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Machine Learning — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science