PySpark Ridge Regression – Building, Tuning, and Evaluating Ridge Regression with PySpark MLlib

Explore how to build, tune, and evaluate a Ridge Regression model using PySpark MLlib, a powerful library for machine learning and data processing in Apache Spark.

Written by Jagdeesh | 4 min read

Lets explore how to build, tune, and evaluate a Ridge Regression model using PySpark MLlib, a powerful library for machine learning and data processing in Apache Spark.

Ridge Regression is an extension of linear regression that includes a regularization term to minimize the magnitude of the model’s coefficients and prevent overfitting.

We will cover the following topics in this post:

Setting up the environment
Loading and preprocessing the data
Creating a Ridge Regression model
Hyperparameter tuning
Evaluating the model
Example code

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

python

import findspark
findspark.init()

from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

spark = SparkSession.builder \
    .appName("Ridge Regression with PySpark MLlib") \
    .getOrCreate()

2. Load the dataset

For this example, we will use the “Boston Housing” dataset. Save the dataset as a CSV file, and then use the following code to load the data into a PySpark DataFrame.

python

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
spark.sparkContext.addFile(url)

data = spark.read.csv(SparkFiles.get("BostonHousing.csv"), header=True, inferSchema=True)
data.show(5)

python

+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 5 rows

3. Prepare the data

Before building the model, we need to assemble the input features into a single feature vector using the VectorAssembler class. Then, we will split the dataset into a training set (80%) and a testing set (20%).

python

# Define the feature and label columns & Assemble the feature vector
assembler = VectorAssembler(
    inputCols=["crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "b", "lstat"],
    outputCol="features")

data = assembler.transform(data)
final_data = data.select("features", "medv")

# Split the data into training and test sets
train_data, test_data = final_data.randomSplit([0.8, 0.2], seed=42)

4. Creating a Ridge Regression model

Now, let’s create a Ridge Regression model using the LinearRegression class from PySpark MLlib. We’ll set the elasticNetParam to 0, which makes the model equivalent to Ridge Regression

Ridge Regression : ElasticNetParam set to 0, the model will use only L2 regularization (Ridge)

Lasso Regression : ElasticNetParam is set to 1, the model will use only L1 regularization (Lasso)

ElasticNet : ElasticNetParam is set to 0.5, the model will use a combination of L1 and L2 regularization

python

ridge_regression = LinearRegression(featuresCol="features", labelCol="medv", elasticNetParam=0)

5. Hyperparameter tuning

To find the optimal hyperparameters for our Ridge Regression model, we’ll perform a grid search using cross-validation. We’ll use the ParamGridBuilder and CrossValidator classes from PySpark MLlib

python

# Define the hyperparameter grid
param_grid = ParamGridBuilder() \
    .addGrid(ridge_regression.regParam, [0.001, 0.01, 0.1, 1.0]) \
    .build()

# Create the cross-validator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol= "medv", metricName="rmse")
cross_validator = CrossValidator(estimator=ridge_regression,
                                 estimatorParamMaps=param_grid,
                                 evaluator=evaluator,
                                 numFolds=5)

# Train the model with the best hyperparameters
cv_model = cross_validator.fit(train_data)
ridge_model = cv_model.bestModel

6. Inspect the model coefficients and intercept

To better understand the Ridge regression model, you can examine its coefficients and intercept. These values represent the weights assigned to each feature and the bias term, respectively.

python

coefficients = ridge_model.coefficients
intercept = ridge_model.intercept

print("Coefficients: ", coefficients)
print("Intercept: {:.3f}".format(intercept))

python

Coefficients:  [-0.10761281681385497,0.04551051167424685,0.0037888509429820205,2.8774619142675006,-17.08737967764298,3.576754283556864,0.0039031613315017055,-1.3505701200658626,0.2886036433948176,-0.011510286387703154,-0.9329729140177865,0.008572548049794496,-0.5098718789059755]
Intercept: 36.636

7. Analyze feature importance

To determine which features contribute most to the model’s predictions, you can analyze the absolute values of the coefficients. Features with higher absolute coefficients have a greater impact on the target variable.

python

feature_importance = sorted(list(zip(data.columns[:-1], map(abs, coefficients))), key=lambda x: x[1], reverse=True)

print("Feature Importance:")
for feature, importance in feature_importance:
    print("  {}: {:.3f}".format(feature, importance))

python

Feature Importance:
  nox: 17.087
  rm: 3.577
  chas: 2.877
  dis: 1.351
  ptratio: 0.933
  lstat: 0.510
  rad: 0.289
  crim: 0.108
  zn: 0.046
  tax: 0.012
  b: 0.009
  age: 0.004
  indus: 0.004

8. Evaluating the model

To evaluate the performance of our Ridge Regression model, we’ll use the RegressionEvaluator class from PySpark MLlib.

python

# Make predictions on the test data
predictions = ridge_model.transform(test_data)

# Evaluate the model
rmse = evaluator.evaluate(predictions)
r2 = RegressionEvaluator(predictionCol="prediction", labelCol="medv", metricName="r2").evaluate(predictions)

print("Root Mean Squared Error (RMSE):", rmse)
print("Coefficient of Determination (R2):", r2)

python

Root Mean Squared Error (RMSE): 4.67222227683228
Coefficient of Determination (R2): 0.7931154341682369

9. Save and load the model (optional)

If you want to reuse the model in the future, you can save it to disk and load it back when needed.

python

# Save the model
ridge_model.save("ridge_model")

# Load the model
from pyspark.ml.regression import LinearRegressionModel
loaded_model = LinearRegressionModel.load("ridge_model")

Conclusion

In this blog post, we have explored how to build, tune, and evaluate a Ridge Regression model using PySpark MLlib. We have covered setting up the environment, loading and preprocessing the data, creating the model, tuning hyperparameters with cross-validation and grid search, and evaluating the model’s performance.

Ridge Regression is an effective technique for handling multicollinearity and preventing overfitting in linear regression models. By leveraging PySpark MLlib, you can easily scale your machine learning tasks and apply them to big data scenarios.

Free Course

Master Core Python — Your First Step into AI/ML

Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.

Start Free Course →

Trusted by 50,000+ learners

Written by

Jagdeesh →

Related Course

Master PySpark — Hands-On

Join 5,000+ students at edu.machinelearningplus.com

Explore Course

PySpark Ridge Regression – Building, Tuning, and Evaluating Ridge Regression with PySpark MLlib

We will cover the following topics in this post:

1. Import required libraries and initialize SparkSession

2. Load the dataset

3. Prepare the data

4. Creating a Ridge Regression model

5. Hyperparameter tuning

6. Inspect the model coefficients and intercept

7. Analyze feature importance

8. Evaluating the model

9. Save and load the model (optional)

Conclusion

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

We will cover the following topics in this post:

1. Import required libraries and initialize SparkSession

2. Load the dataset

3. Prepare the data

4. Creating a Ridge Regression model

5. Hyperparameter tuning

6. Inspect the model coefficients and intercept

7. Analyze feature importance

8. Evaluating the model

9. Save and load the model (optional)

Conclusion

Related Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

Python.SQL. NumPy. All free.

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Python.
SQL. NumPy.
All free.