PySpark Gradient Boosting model – Building and Evaluating Gradient Boosting model using PySpark MLlib: A Step-By-Step Guide

Lets discuss how to build and evaluate Gradient Boosting model using PySpark MLlib and cover key aspects such as hyperparameter tuning and variable selection, providing example code to help you along the way.

Gradient Boosting is a powerful machine learning technique that combines multiple weak learners to create a strong predictor. Pyspark MLlib is a popular tool for building machine learning models in the Spark framework.

In this blog post, we will explore how to build and evaluate a Gradient Boosting model using Pyspark MLlib, including hyperparameter tuning and variable selection.

What is Gradient Boosting?

Gradient Boosting is a machine learning technique that builds an ensemble of weak learners, typically decision trees, to create a strong predictor. The algorithm starts by fitting a simple model to the data, such as a decision tree with one or two levels.

The residuals from this model are then used to train a second model, which is added to the ensemble. This process is repeated many times, with each new model trained on the residuals of the previous models. The final predictor is the sum of all the models in the ensemble.

We will cover the following topics in this post:

Setting up the environment
Loading and preprocessing the data
Bilding a Gradient Boosting model
Hyperparameter tuning
Evaluating the model
Example code

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()
from pyspark import SparkFiles
from pyspark.sql import SparkSession

from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("GradientBoostingExample").getOrCreate()

2. Load the dataset

For this example, we will use the "BostonHousing" dataset. Save the dataset as a CSV file, and then use the following code to load the data into a PySpark DataFrame.

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
spark.sparkContext.addFile(url)

df = spark.read.csv(SparkFiles.get("BostonHousing.csv"), header=True, inferSchema=True)
df.show(5)

+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.02729| 0.0| 7.07|   0|0.469|7.185|61.1|4.9671|  2|242|   17.8|392.83| 4.03|34.7|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
|0.06905| 0.0| 2.18|   0|0.458|7.147|54.2|6.0622|  3|222|   18.7| 396.9| 5.33|36.2|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 5 rows

3. Prepare the data

Before building the model, we need to assemble the input features into a single feature vector using the VectorAssembler class. Then, we will split the dataset into a training set (80%) and a testing set (20%).

# Define the feature and label columns & Assemble the feature vector
assembler = VectorAssembler(inputCols=df.columns[:-1], outputCol='features')
data = assembler.transform(df)

# Split the data into training and test sets
train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)

4. Gradient Boosting model

Now, we can create a Gradient Boosting model using the GBTRegressor class.

# Create a Gradient Boosting model
gbt = GBTRegressor(featuresCol='features', labelCol='medv', seed=42)

# Train the model
model = gbt.fit(train_data)

5. Model Evaluation

Use the model to make predictions on the testing data.

# Make predictions on the testing data
predictions = model.transform(test_data)

# Evaluate the model
rmse_eval = RegressionEvaluator(labelCol='medv', metricName='rmse')
mae_eval = RegressionEvaluator(labelCol='medv', metricName='mae')

rmse = rmse_eval.evaluate(predictions)
mae = mae_eval.evaluate(predictions)


print("RMSE: {:.2f}".format(rmse))
print("MAE: {:.2f}".format(mae))

RMSE: 3.89
MAE: 2.41

6. Hyperparameter Tuning and Model Selection

Hyperparameter tuning is the process of selecting the best values for the parameters of a machine learning model.

In Gradient Boosting, the main hyperparameters are the number of trees, the learning rate, and the maximum depth of each tree. We can use cross-validation and grid search to find the best hyperparameters.

# Define the hyperparameter grid
param_grid = ParamGridBuilder() \
    .addGrid(gbt.maxDepth, [2, 4, 6]) \
    .addGrid(gbt.maxIter, [10, 50, 100]) \
    .addGrid(gbt.stepSize, [0.1, 0.01]) \
    .build()

# Evaluate the model
evaluator = RegressionEvaluator(labelCol='medv', metricName='rmse')

# Define the cross-validator
crossval = CrossValidator(estimator=gbt,
                          estimatorParamMaps=param_grid,
                          evaluator=evaluator,
                          numFolds=5, seed=42)

# Train the model using cross-validation
cv_model = crossval.fit(train_data)

# Make predictions on the testing data
cv_predictions = cv_model.transform(test_data)

cv_rmse = evaluator.evaluate(cv_predictions)
print("CV RMSE: {:.2f}".format(cv_rmse))

CV RMSE: 3.06

In this example, we used a ParamGridBuilder to define a grid of hyperparameters to search over.

We then used a CrossValidator to perform a 5-fold cross-validation and select the best hyperparameters. The numFolds parameter specifies the number of folds to use for cross-validation.

7. Analyze feature importance

Variable selection is the process of selecting the most important features for a machine learning model. In Gradient Boosting, we can use feature importance scores to determine which features are the most important.

# Get feature importance scores
importances = model.featureImportances

# Create a list of feature names
features = data.columns[:-1]

# Print the feature importance scores
for feature, importance in zip(features, importances):
    print(feature, "{:.4f}".format(importance))

crim 0.0559
zn 0.0069
indus 0.0260
chas 0.0016
nox 0.0203
rm 0.2385
age 0.0538
dis 0.0550
rad 0.0204
tax 0.0405
ptratio 0.0718
b 0.0440
lstat 0.3652

The featureImportances method returns a vector of feature importance scores. We can then map these scores back to the feature names to determine which features are the most important.

8. Save and load the model (optional)

If you want to reuse the model in the future, you can save it to disk and load it back when needed.

# Save the model
model.save("GBT_model")

# Load the model
from pyspark.ml.regression import GBTRegressionModel
loaded_model = GBTRegressionModel.load("GBT_model")

Conclusion

In this blog post, we explored how to build and evaluate a Gradient Boosting model using Pyspark MLlib, including hyperparameter tuning and variable selection.

We used the Boston Housing dataset as an example, but these techniques can be applied to any dataset. Gradient Boosting is a powerful machine learning technique that can be used for a wide range of predictive modeling tasks, and Pyspark MLlib provides a flexible and scalable platform for building and deploying these models.

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Machine Learning

Pyspark