Z score, also called as standard score, is used to scale the features in a dataset for machine learning model training. It can also be used to detect outliers. In this one, we will first see how to compute Z-scores and then use it to detect outliers.
How is Z-score used in machine learning?
Now, different variables/features in a dataset have different range of values.
For example: A feature like ‘Age’ may vary from 1 to 90, whereas ‘income’ may go all the to 10’s of thousands. Looking at a particular value (or observation), for any given variable, it is difficult to say how far is it from the mean without actually computing the mean.
This is where Z-score helps.
It is used to standardize the variable, so that just by knowing the value of a particular observation, you get the sense of how far away it is from the mean.
More specifically, ‘Z score’ tells how many standard deviations away a data point is from the mean.
The process of transforming a feature to its z-scores is called ‘Standardization’.
Z Score Formula
The formula for Z-score is as follows:
$$ Z score = (x -mean) / std. deviation $$
If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.
Z-score can be both positive and negative. The farther away from 0, higher the chance of a given data point being an outlier. Typically, Z-score greater than 3 is considered extreme.
source: pinterest graphic
A normal distribution is shown below and it is estimated that:
a) 68% of the data points lie between +/- 1 standard deviation.
b) 95% of the data points lie between +/- 2 standard deviation
c) 99.7% of the data points lie between +/- 3 standard deviation
Common Data that follow normal distribution:
1. Heights of people
2. Income
3. Blood pressure
4. Salaries etc
Load the dataset (dont run if you followed along previously)
# Import libraries
# Data Manipulation
import numpy as np
import pandas as pd
from pandas import DataFrame
# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Maths
import math
# Set pandas options to show more rows and columns
pd.set_option('display.max_rows', 800)
pd.set_option('display.max_columns', 500)
%matplotlib inline
Read the data
# Read data in form of a csv file
df = pd.read_csv("../00_Datasets/Churn_Modelling_m.csv")
# First 5 rows of the dataset
df.head()
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619.0 | France | Female | 42.0 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608.0 | Spain | Female | 41.0 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502.0 | France | NaN | NaN | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699.0 | France | NaN | 39.0 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850.0 | Spain | Female | 43.0 | 2 | NaN | 1 | 1 | 1 | 79084.10 | 0 |
Histogram
plt.hist(df.CreditScore, bins=20, rwidth=0.8)
plt.xlabel('CreditScore')
plt.ylabel('Count')
plt.title('Histogram - CreditScore')
plt.show()
Standard Deviation and Mean
np.nanstd(df.CreditScore.values.tolist())
96.6527190618191
np.nanmean(df.CreditScore.values)
650.5254525452546
Check for any infinity values.
np.isinf(df[['CreditScore']]).values.sum()
0
Let’s compute the Z-score. First, compute the mean and standard deviation.
# Compute Z Score
cr_mean = np.nanmean(df.CreditScore.values.tolist())
cr_std = np.nanstd(df.CreditScore.values.tolist())
print("Mean Credit Score is: ", cr_mean)
print("Std Credit Score is: ", cr_std)
Mean Credit Score is: 650.5254525452546
Std Credit Score is: 96.6527190618191
Calculate Z Score
From each observation, subtract the mean and divide by the standard deviation.
df['zscore_CreditScore'] = (df.CreditScore - cr_mean ) / cr_std
df[["Surname", "CreditScore", "zscore_CreditScore"]].head()
Surname | CreditScore | zscore_CreditScore | |
---|---|---|---|
0 | Hargrave | 619.0 | -0.326172 |
1 | Hill | 608.0 | -0.439982 |
2 | Onio | 502.0 | -1.536692 |
3 | Boni | 699.0 | 0.501533 |
4 | Mitchell | 850.0 | 2.063828 |
Extract the outliers/extreme values based on Z-score
Generally, we consider the values outside of +3 and -3 standard deviations to be extreme values. Let’s extract them.
# Extreme values based on credit score.
df[(df.zscore_CreditScore<-3) | (df.zscore_CreditScore>3)]
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | zscore_CreditScore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1405 | 1406 | 15612494 | Panicucci | 359.0 | France | Female | 44.0 | 6 | 128747.69 | 1 | 1 | 0 | 146955.71 | 1 | -3.016216 |
1631 | 1632 | 15685372 | Azubuike | 350.0 | Spain | Male | 54.0 | 1 | 152677.48 | 1 | 1 | 1 | 191973.49 | 1 | -3.109333 |
1838 | 1839 | 15758813 | Campbell | 350.0 | Germany | Male | 39.0 | 0 | 109733.20 | 2 | 0 | 0 | 123602.11 | 1 | -3.109333 |
1962 | 1963 | 15692416 | Aikenhead | 358.0 | Spain | Female | 52.0 | 8 | 143542.36 | 3 | 1 | 0 | 141959.11 | 1 | -3.026562 |
2473 | 2474 | 15679249 | Chou | 351.0 | Germany | Female | 57.0 | 4 | 163146.46 | 1 | 1 | 0 | 169621.69 | 1 | -3.098986 |
8723 | 8724 | 15803202 | Onyekachi | 350.0 | France | Male | 51.0 | 10 | 0.00 | 1 | 1 | 1 | 125823.79 | 1 | -3.109333 |
8762 | 8763 | 15765173 | Lin | 350.0 | France | Female | 60.0 | 3 | 0.00 | 1 | 0 | 0 | 113796.15 | 1 | -3.109333 |
9624 | 9625 | 15668309 | Maslow | 350.0 | France | Female | 40.0 | 0 | 111098.85 | 1 | 1 | 1 | 172321.21 | 1 | -3.109333 |
Treat Outliers
Find the Credit score value corresponding to z = 3 and -3. These will be the upper and lower caps.
z_3 = (3 * cr_std)+ (cr_mean)
print(z_3)
z_minus3 = (cr_mean) - (3 * cr_std)
print(z_minus3)
940.4836097307118
360.56729535979724
Replace the values by capping with the upper and lower limits.
## Cap Outliers
# df[CreditScore][(df.zscore_CreditScore<-3)] = z_minus3
# df[CreditScore][(df.zscore_CreditScore>3)] = z_3
What are different ways to treat outliers?
It is not always a requirement to ‘treat’ outliers. If you feel that the outliers are valid datapoints and you want the ML algorithm to model and predict them, then no need to ‘treat’ outliers.
However, if you feel you don’t want your model to make such extreme predictions, then you should go ahead and treat them.
There are different ways to treat the outliers:
- Remove the observations containing the outliers
-
Quantile based capping the extreme values. For example: All values greater than 99%ile can be replaced with 99%ile value or all values greater than z-score of 3 be replaced with 3.
-
Treat the value as missing value and use all the different [imputation methods].
Remove the outlier observations
In the previous section, you have computed the z score. All you have to do is remove the points which has z score more than 3 or less than -3. Or have the points which have z score less than 3 and more than -3.
new_df = df[(df.zscore_CreditScore<-3) | (df.zscore_CreditScore>3)]
new_df.head()
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | zscore_CreditScore | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1405 | 1406 | 15612494 | Panicucci | 359.0 | France | Female | 44.0 | 6 | 128747.69 | 1 | 1 | 0 | 146955.71 | 1 | -3.016216 |
1631 | 1632 | 15685372 | Azubuike | 350.0 | Spain | Male | 54.0 | 1 | 152677.48 | 1 | 1 | 1 | 191973.49 | 1 | -3.109333 |
1838 | 1839 | 15758813 | Campbell | 350.0 | Germany | Male | 39.0 | 0 | 109733.20 | 2 | 0 | 0 | 123602.11 | 1 | -3.109333 |
1962 | 1963 | 15692416 | Aikenhead | 358.0 | Spain | Female | 52.0 | 8 | 143542.36 | 3 | 1 | 0 | 141959.11 | 1 | -3.026562 |
2473 | 2474 | 15679249 | Chou | 351.0 | Germany | Female | 57.0 | 4 | 163146.46 | 1 | 1 | 0 | 169621.69 | 1 | -3.098986 |
Quantile based capping
Cap the outliers with the quantile values, generally we go ahead with 5% and 95% quantiles (or 10% and 90% quantiles). As per the requirement you can change it as well
# Computing 10th, 90th percentiles and replacing the outliers
lower_cap_percentile = np.nanpercentile(df['CreditScore'], 10)
upper_cap_percentile = np.nanpercentile(df['CreditScore'], 90)
print("10 percentile :", lower_cap_percentile)
print("90 percentile :", upper_cap_percentile)
10 percentile : 521.0
90 percentile : 778.0
Let’s print the original values from rownumbers 1406, 1632, 1839 which has 3 outliers. Later we will print it post outlier treatment
# original values
mask = df.RowNumber.isin([1406, 1632, 1839])
df.loc[mask, 'CreditScore']
1405 359.0
1631 350.0
1838 350.0
Name: CreditScore, dtype: float64
Do Outlier Capping
That is, those values lower than the lower_cap_percentile
, will be replaced with lower_cap_percentile
.
# Outlier capping
new_col = np.where(df['CreditScore'] < lower_cap_percentile, lower_cap_percentile, df['CreditScore'])
df['CreditScore_capped'] = new_col
df[['CreditScore', 'CreditScore_capped']][mask]
CreditScore | CreditScore_capped | |
---|---|---|
1405 | 359.0 | 521.0 |
1631 | 350.0 | 521.0 |
1838 | 350.0 | 521.0 |
As you can see the outlier values have now been capped with the lower limit.
Imputation based Approaches to treat outliers
There are more efficient methods of outlier detection, especially when you think the recorded outlier value is an error and you want to fix it with what would have been an appropriate value.
You can use the multivariate prediction approaches using MICE and other methods. These have been discussed in detail in the missing value imputation methods (video) and MICE (video).