PySpark Pivot – A Detailed Guide Harnessing the Power of PySpark Pivot

Pivoting is a data transformation technique that involves converting rows into columns. PySpark’s ability to pivot DataFrames enables you to reshape data for more convenient analysis.

What is Pivoting?

Pivoting is a data transformation technique that involves converting rows into columns. This operation is valuable when reorganizing data for enhanced readability, aggregation, or analysis.

The PySpark Pivot Function

The pivot function in PySpark is a method available for GroupedData objects, allowing you to execute a pivot operation on a DataFrame. The general syntax for the pivot function is:

GroupedData.pivot(pivot_col, values=None)

Arguments:

pivot_col: The column you wish to pivot.

values: An optional list of values to include in the pivoted DataFrame. If not specified, all unique values in the pivot column will be used.

To utilize the pivot function, you must first group your DataFrame using the groupBy function. Next, you can call the pivot function on the GroupedData object, followed by the aggregation function.

Importing necessary libraries and creating a sample DataFrames

import findspark
findspark.init()

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Pivot Example").getOrCreate()

# Sample data
data = [
    (2021, 1, "US", 5000),
    (2021, 1, "EU", 4000),
    (2021, 2, "US", 5500),
    (2021, 2, "EU", 4500),
    (2021, 3, "US", 6000),
    (2021, 3, "EU", 5000),
    (2021, 4, "US", 7000),
    (2021, 4, "EU", 6000),
]

# Create DataFrame
columns = ["year", "quarter", "region", "revenue"]
df = spark.createDataFrame(data, columns)
df.show()

+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021|      1|    US|   5000|
|2021|      1|    EU|   4000|
|2021|      2|    US|   5500|
|2021|      2|    EU|   4500|
|2021|      3|    US|   6000|
|2021|      3|    EU|   5000|
|2021|      4|    US|   7000|
|2021|      4|    EU|   6000|
+----+-------+------+-------+

1. How to Pivot PySpark DataFrame?

Pivoting a DataFrame (i.e., convert it from long format to wide format) is a transformation technique that involves converting rows into columns

# Execute the pivot operation

pivot_df = df.groupBy("year", "quarter").pivot("region").sum("revenue")

pivot_df.show()

+----+-------+----+----+
|year|quarter|  EU|  US|
+----+-------+----+----+
|2021|      2|4500|5500|
|2021|      1|4000|5000|
|2021|      3|5000|6000|
|2021|      4|6000|7000|
+----+-------+----+----+

As shown above, we have successfully pivoted the data by region, displaying the revenue for each quarter in separate columns for the US and EU regions.

2. How to Unpivot PySpark DataFrame?

Unpivot is a reverse operation Pivot

To unpivot a DataFrame (i.e., convert it from wide format to long format), you can use the stack function with a combination of select and expr

# Method:1 Unpivot PySpark DataFrame

from pyspark.sql.functions import expr

unpivotExpr = "stack(2, 'EU',EU, 'US', US) as (region,revenue)"

unPivotDF = pivot_df.select("year","quarter", expr(unpivotExpr)).where("revenue is not null")

unPivotDF.show()

+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021|      2|    EU|   4500|
|2021|      2|    US|   5500|
|2021|      1|    EU|   4000|
|2021|      1|    US|   5000|
|2021|      3|    EU|   5000|
|2021|      3|    US|   6000|
|2021|      4|    EU|   6000|
|2021|      4|    US|   7000|
+----+-------+------+-------+

# Method:2 Unpivot PySpark DataFrame

from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
def melt_df(
        df: DataFrame,
        id_vars: Iterable[str], value_vars: Iterable[str],
        var_name: str="variable", value_name: str="value") -> DataFrame:

    _vars_and_vals = array(*(
        struct(lit(c).alias(var_name), col(c).alias(value_name))
        for c in value_vars))

    _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))

    cols = id_vars + [
            col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
    return _tmp.select(*cols)

df_melt = melt_df(pivot_df, id_vars = ['year', 'quarter'], value_vars = ['EU','US'], 
                        var_name = 'region', value_name = 'revenue')
df_melt.show()

+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021|      2|    EU|   4500|
|2021|      2|    US|   5500|
|2021|      1|    EU|   4000|
|2021|      1|    US|   5000|
|2021|      3|    EU|   5000|
|2021|      3|    US|   6000|
|2021|      4|    EU|   6000|
|2021|      4|    US|   7000|
+----+-------+------+-------+

Limitations and Best Practices

Although pivoting is a powerful data transformation technique, it has some limitations:

Pivoting may result in a wide DataFrame with numerous columns, which can be challenging to analyze or visualize.
If the pivot column has a high number of unique values, the resulting DataFrame may become extremely large, potentially exceeding available memory and causing performance issues.

To address these limitations, consider the following best practices:

Filter data before pivoting to include only pertinent values.
Set the optional values parameter to limit the number of pivoted columns.
Whenever possible, use the pivot function with an aggregation function to reduce the DataFrame’s size and enhance processing efficiency.

Conclusion

In this blog post, we covered the concept of pivoting and the PySpark pivot function, walking you through an example of analyzing sales data. We also discussed the limitations of pivoting and shared some best practices for optimal usage.

With this newfound knowledge, you are now better equipped to leverage the power of PySpark’s pivot function in your big data projects.

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Pyspark

PySpark Pivot – A Detailed Guide Harnessing the Power of PySpark Pivot

What is Pivoting?

The PySpark Pivot Function

Importing necessary libraries and creating a sample DataFrames

1. How to Pivot PySpark DataFrame?

2. How to Unpivot PySpark DataFrame?

Limitations and Best Practices

Although pivoting is a powerful data transformation technique, it has some limitations:

To address these limitations, consider the following best practices:

Conclusion

More Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: