# PySpark Variance Inflation Factor (VIF) – Understanding of VIF and how it can help you improve your regression models.

VIF concept is critical for understanding multicollinearity in regression models, let’s break down the concept into simple terms, explain how to calculate VIF, and discuss its practical uses

## What is Variance Inflation Factor (VIF)?

VIF is a measure that helps us understand the extent of multicollinearity in a multiple regression model. Multicollinearity occurs when two or more independent variables in the model are highly correlated with each other. This correlation can cause problems in interpreting the model, as it becomes difficult to determine the individual impact of each variable on the dependent variable.

VIF quantifies the severity of multicollinearity by measuring how much the variance of a regression coefficient is increased due to the presence of multicollinearity. In simpler terms, VIF tells us how much the uncertainty of a coefficient is inflated because of the correlation between independent variables.

## How to Calculate VIF?

Calculating VIF involves the following steps

1. Run a multiple regression model with all the independent variables.

2. For each independent variable, run a separate regression model using it as the dependent variable and the remaining independent variables as predictors.

3. Calculate the R-squared value for each of these separate regression models.

4. Compute VIF for each independent variable using the following formula:

VIF = 1 / (1 – R-squared)

Let’s take a look at an example to better understand the process

Imagine we have a dataset with three independent variables: X1, X2, and X3. We want to calculate the VIF for each variable.

1. First, run a multiple regression with X1, X2, and X3 as independent variables and the dependent variable, Y.

2. Next, run three separate regression models

a. X1 as the dependent variable, with X2 and X3 as predictors.

b. X2 as the dependent variable, with X1 and X3 as predictors.

c. X3 as the dependent variable, with X1 and X2 as predictors.

3. Calculate the R-squared values for each of these separate regression models.

4. Compute the VIF for each independent variable using the formula above.

## What is the Use of VIF?

The primary use of VIF is to identify and mitigate multicollinearity in regression models. High VIF values indicate that an independent variable is highly correlated with other independent variables in the model.

As a rule of thumb, a VIF value above 10 suggests severe multicollinearity, while a value below 5 is generally considered acceptable.

By calculating VIF, you can

1. Identify the variables that contribute the most to multicollinearity.

2. Decide whether to remove, combine, or transform variables to reduce multicollinearity.

3. Improve the interpretability and reliability of your regression models.

## 1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()

from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("VIF Calculation").getOrCreate()


## 2. Preparing the Sample Data

To demonstrate the method of calculating VIF, we’ll use a sample dataset. First, let’s load the data into a DataFrame

# Sample dataset
url = "https://raw.githubusercontent.com/selva86/datasets/master/Iris.csv"

df.show(5)

columns = ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]

+---+-------------+------------+-------------+------------+-----------+
| Id|SepalLengthCm|SepalWidthCm|PetalLengthCm|PetalWidthCm|    Species|
+---+-------------+------------+-------------+------------+-----------+
|  1|          5.1|         3.5|          1.4|         0.2|Iris-setosa|
|  2|          4.9|         3.0|          1.4|         0.2|Iris-setosa|
|  3|          4.7|         3.2|          1.3|         0.2|Iris-setosa|
|  4|          4.6|         3.1|          1.5|         0.2|Iris-setosa|
|  5|          5.0|         3.6|          1.4|         0.2|Iris-setosa|
+---+-------------+------------+-------------+------------+-----------+
only showing top 5 rows


## 3. VIF Function

Let’s create a PySpark function to calculate VIF for the defined variables and eliminate variables iteratively based on a given VIF threshold.

def calculate_vif(data, features, vif_threshold=5):
"""
Calculate Variance Inflation Factor (VIF) for the defined variables
and eliminate variables iteratively based on VIF threshold.

:param data: A PySpark DataFrame containing the predictor variables
:param features: A list of column names for the predictor variables
:param vif_threshold: The VIF threshold for eliminating variables (default is 5)
:return: 1. A list of remaining features after eliminating variables based on the VIF threshold
2. A DataFrame of remaining features and their VIF Values
"""
remaining_features = features[:]
vif_values = []

while True:
vif_values.clear()
for feature in remaining_features:
# Assemble the features into a vector
assembler = VectorAssembler(inputCols=[col for col in remaining_features if col != feature],
outputCol="features")
feature_data = assembler.transform(data)

# Fit a linear regression model
lr = LinearRegression(featuresCol='features', labelCol=feature)
lr_model = lr.fit(feature_data)

# Calculate VIF for the feature
vif = 1 / (1 - lr_model.summary.r2)
vif_values.append((feature, vif))

# Find the feature with the highest VIF
max_vif_feature, max_vif_value = max(vif_values, key=lambda item: item)

# Eliminate the feature if its VIF is above the threshold
if max_vif_value > vif_threshold:
remaining_features.remove(max_vif_feature)
else:
break
vif_df = spark.createDataFrame(vif_values, ['Variable', 'VIF'])
return remaining_features, vif_df


## 4. How to use the calculate_vif function?

Before executing the calculate_vif function have a detailed understanding on the function parameters and their data types

This function will return two PySpark objects
1. A list of remaining features after eliminating variables based on the VIF threshold
2. A DataFrame of remaining features and their VIF Values

Let’s take a look at below example to better understand the execution process

remaining_features, vif_values = calculate_vif(df, columns, vif_threshold=5)

print("Remaining features after VIF elimination:", remaining_features)
vif_values.show()

Remaining features after VIF elimination: ['SepalLengthCm', 'SepalWidthCm', 'PetalWidthCm']
+-------------+------------------+
|     Variable|               VIF|
+-------------+------------------+
|SepalLengthCm| 3.414225339022868|
| SepalWidthCm|1.2945066660800224|
| PetalWidthCm|3.8646777121823304|
+-------------+------------------+


# Conclusion

Variance Inflation Factor (VIF) is a vital tool in detecting multicollinearity in multiple regression models. By understanding VIF and how to calculate it, you can build more accurate and robust regression models, making your data analysis more reliable and insightful

## Similar Articles

### Logistic Regression – A Complete Tutorial With Examples in R Course Preview

## Machine Learning A-Z™: Hands-On Python & R In Data Science

### Free Sample Videos: #### Machine Learning A-Z™: Hands-On Python & R In Data Science #### Machine Learning A-Z™: Hands-On Python & R In Data Science #### Machine Learning A-Z™: Hands-On Python & R In Data Science #### Machine Learning A-Z™: Hands-On Python & R In Data Science 