Menu

How to detect outliers with z-score

Z score, also called as standard score, is used to scale the features in a dataset. It can also be used to detect outliers.

Written by Selva Prabhakaran | 6 min read

Z score, also called as standard score, is used to scale the features in a dataset for machine learning model training. It can also be used to detect outliers. In this one, we will first see how to compute Z-scores and then use it to detect outliers.

How is Z-score used in machine learning?

Now, different variables/features in a dataset have different range of values.

For example: A feature like ‘Age’ may vary from 1 to 90, whereas ‘income’ may go all the to 10’s of thousands. Looking at a particular value (or observation), for any given variable, it is difficult to say how far is it from the mean without actually computing the mean.

This is where Z-score helps.

It is used to standardize the variable, so that just by knowing the value of a particular observation, you get the sense of how far away it is from the mean.

More specifically, ‘Z score’ tells how many standard deviations away a data point is from the mean.

The process of transforming a feature to its z-scores is called ‘Standardization’.

Z Score Formula

The formula for Z-score is as follows:

$$ Z score = (x -mean) / std. deviation $$

If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.

Z-score can be both positive and negative. The farther away from 0, higher the chance of a given data point being an outlier. Typically, Z-score greater than 3 is considered extreme.

source: pinterest graphic

A normal distribution is shown below and it is estimated that:
a) 68% of the data points lie between +/- 1 standard deviation.
b) 95% of the data points lie between +/- 2 standard deviation
c) 99.7% of the data points lie between +/- 3 standard deviation

Common Data that follow normal distribution:
1. Heights of people
2. Income
3. Blood pressure
4. Salaries etc

Load the dataset (dont run if you followed along previously)

python
# Import libraries 
# Data Manipulation
import numpy as np 
import pandas as pd
from   pandas import DataFrame

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Maths
import math

# Set pandas options to show more rows and columns
pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 500)
%matplotlib inline

Read the data

python
# Read data in form of a csv file
df = pd.read_csv("../00_Datasets/Churn_Modelling_m.csv")

# First 5 rows of the dataset
df.head()
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0115634602Hargrave619.0FranceFemale42.020.00111101348.881
1215647311Hill608.0SpainFemale41.0183807.86101112542.580
2315619304Onio502.0FranceNaNNaN8159660.80310113931.571
3415701354Boni699.0FranceNaN39.010.0020093826.630
4515737888Mitchell850.0SpainFemale43.02NaN11179084.100

Histogram

python
plt.hist(df.CreditScore, bins=20, rwidth=0.8)
plt.xlabel('CreditScore')
plt.ylabel('Count')
plt.title('Histogram - CreditScore')
plt.show()

Standard Deviation and Mean

python
np.nanstd(df.CreditScore.values.tolist())
python
96.6527190618191
python
np.nanmean(df.CreditScore.values)
python
650.5254525452546

Check for any infinity values.

python
np.isinf(df[['CreditScore']]).values.sum()
python
0

Let’s compute the Z-score. First, compute the mean and standard deviation.

python
# Compute Z Score
cr_mean = np.nanmean(df.CreditScore.values.tolist())
cr_std = np.nanstd(df.CreditScore.values.tolist())

print("Mean Credit Score is: ", cr_mean)
print("Std Credit Score is: ", cr_std)
python
Mean Credit Score is:  650.5254525452546
Std Credit Score is:  96.6527190618191

Calculate Z Score

From each observation, subtract the mean and divide by the standard deviation.

python
df['zscore_CreditScore'] = (df.CreditScore  - cr_mean ) / cr_std
df[["Surname", "CreditScore", "zscore_CreditScore"]].head()
SurnameCreditScorezscore_CreditScore
0Hargrave619.0-0.326172
1Hill608.0-0.439982
2Onio502.0-1.536692
3Boni699.00.501533
4Mitchell850.02.063828

Extract the outliers/extreme values based on Z-score

Generally, we consider the values outside of +3 and -3 standard deviations to be extreme values. Let’s extract them.

python
# Extreme values based on credit score.
df[(df.zscore_CreditScore<-3) | (df.zscore_CreditScore>3)]
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedzscore_CreditScore
1405140615612494Panicucci359.0FranceFemale44.06128747.69110146955.711-3.016216
1631163215685372Azubuike350.0SpainMale54.01152677.48111191973.491-3.109333
1838183915758813Campbell350.0GermanyMale39.00109733.20200123602.111-3.109333
1962196315692416Aikenhead358.0SpainFemale52.08143542.36310141959.111-3.026562
2473247415679249Chou351.0GermanyFemale57.04163146.46110169621.691-3.098986
8723872415803202Onyekachi350.0FranceMale51.0100.00111125823.791-3.109333
8762876315765173Lin350.0FranceFemale60.030.00100113796.151-3.109333
9624962515668309Maslow350.0FranceFemale40.00111098.85111172321.211-3.109333

Treat Outliers

Find the Credit score value corresponding to z = 3 and -3. These will be the upper and lower caps.

python
z_3 = (3 * cr_std)+ (cr_mean)
print(z_3)

z_minus3 = (cr_mean) - (3 * cr_std)
print(z_minus3)
python
940.4836097307118
360.56729535979724

Replace the values by capping with the upper and lower limits.

python
## Cap Outliers 
# df[CreditScore][(df.zscore_CreditScore<-3)] = z_minus3 
# df[CreditScore][(df.zscore_CreditScore>3)] = z_3

What are different ways to treat outliers?

It is not always a requirement to ‘treat’ outliers. If you feel that the outliers are valid datapoints and you want the ML algorithm to model and predict them, then no need to ‘treat’ outliers.

However, if you feel you don’t want your model to make such extreme predictions, then you should go ahead and treat them.

There are different ways to treat the outliers:

  1. Remove the observations containing the outliers

  2. Quantile based capping the extreme values. For example: All values greater than 99%ile can be replaced with 99%ile value or all values greater than z-score of 3 be replaced with 3.

  3. Treat the value as missing value and use all the different [imputation methods].

Remove the outlier observations

In the previous section, you have computed the z score. All you have to do is remove the points which has z score more than 3 or less than -3. Or have the points which have z score less than 3 and more than -3.

python
new_df = df[(df.zscore_CreditScore<-3) | (df.zscore_CreditScore>3)]
new_df.head()
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExitedzscore_CreditScore
1405140615612494Panicucci359.0FranceFemale44.06128747.69110146955.711-3.016216
1631163215685372Azubuike350.0SpainMale54.01152677.48111191973.491-3.109333
1838183915758813Campbell350.0GermanyMale39.00109733.20200123602.111-3.109333
1962196315692416Aikenhead358.0SpainFemale52.08143542.36310141959.111-3.026562
2473247415679249Chou351.0GermanyFemale57.04163146.46110169621.691-3.098986

Quantile based capping

Cap the outliers with the quantile values, generally we go ahead with 5% and 95% quantiles (or 10% and 90% quantiles). As per the requirement you can change it as well

python
# Computing 10th, 90th percentiles and replacing the outliers
lower_cap_percentile = np.nanpercentile(df['CreditScore'], 10)
upper_cap_percentile = np.nanpercentile(df['CreditScore'], 90)

print("10 percentile :", lower_cap_percentile)
print("90 percentile :", upper_cap_percentile)
python
10 percentile : 521.0
90 percentile : 778.0

Let’s print the original values from rownumbers 1406, 1632, 1839 which has 3 outliers. Later we will print it post outlier treatment

python
# original values
mask = df.RowNumber.isin([1406, 1632, 1839])
df.loc[mask, 'CreditScore']
python
1405    359.0
1631    350.0
1838    350.0
Name: CreditScore, dtype: float64

Do Outlier Capping

That is, those values lower than the lower_cap_percentile, will be replaced with lower_cap_percentile.

python
# Outlier capping
new_col = np.where(df['CreditScore'] < lower_cap_percentile, lower_cap_percentile, df['CreditScore'])
df['CreditScore_capped'] = new_col
df[['CreditScore', 'CreditScore_capped']][mask]
CreditScoreCreditScore_capped
1405359.0521.0
1631350.0521.0
1838350.0521.0

As you can see the outlier values have now been capped with the lower limit.

Imputation based Approaches to treat outliers

There are more efficient methods of outlier detection, especially when you think the recorded outlier value is an error and you want to fix it with what would have been an appropriate value.

You can use the multivariate prediction approaches using MICE and other methods. These have been discussed in detail in the missing value imputation methods (video) and MICE (video).

Free Course
Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course
Trusted by 50,000+ learners
Related Course
Master Machine Learning — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course
Free Callback - Limited Slots
Not Sure Which Course to Start With?
Talk to our AI Counsellors and Practitioners. We'll help you clear all your questions for your background and goals, bridging the gap between your current skills and a career in AI.
10-digit mobile number
📞
Thank You!
We'll Call You Soon!
Our learning advisor will reach out within 24 hours.
(Check your inbox too — we've sent a confirmation)
⚡ Before you go

Python.
SQL. NumPy.
All free.

Get the exact 10-course programming foundation that Data Science professionals use.

🐍
Core Python — from first line to expert level
📈
NumPy & Pandas — the #1 libraries every DS job needs
🗃️
SQL Levels I–III — basics to Window Functions
📄
Real industry data — Jupyter notebooks included
R A M S K
57,000+ students
★★★★★ Rated 4.9/5
⚡ Before you go
Python. SQL.
All Free.
R A M S K
57,000+ students  ★★★★★ 4.9/5
Get Free Access Now
10 courses. Real projects. Zero cost. No credit card.
New learners enrolling right now
🔒 100% free ☕ No spam, ever ✓ Instant access
🚀
You're in!
Check your inbox for your access link.
(Check Promotions or Spam if you don't see it)
Or start your first course right now:
Start Free Course →
Scroll to Top
Scroll to Top
Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science