Menu

PySpark Pivot – A Detailed Guide Harnessing the Power of PySpark Pivot

Pivoting is a data transformation technique that involves converting rows into columns. PySpark’s ability to pivot DataFrames enables you to reshape data for more convenient analysis.

What is Pivoting?

Pivoting is a data transformation technique that involves converting rows into columns. This operation is valuable when reorganizing data for enhanced readability, aggregation, or analysis.

The PySpark Pivot Function

The pivot function in PySpark is a method available for GroupedData objects, allowing you to execute a pivot operation on a DataFrame. The general syntax for the pivot function is:

GroupedData.pivot(pivot_col, values=None)

Arguments:

pivot_col: The column you wish to pivot.

values: An optional list of values to include in the pivoted DataFrame. If not specified, all unique values in the pivot column will be used.

To utilize the pivot function, you must first group your DataFrame using the groupBy function. Next, you can call the pivot function on the GroupedData object, followed by the aggregation function.

Importing necessary libraries and creating a sample DataFrames

import findspark
findspark.init()

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("PySpark Pivot Example").getOrCreate()

# Sample data
data = [
    (2021, 1, "US", 5000),
    (2021, 1, "EU", 4000),
    (2021, 2, "US", 5500),
    (2021, 2, "EU", 4500),
    (2021, 3, "US", 6000),
    (2021, 3, "EU", 5000),
    (2021, 4, "US", 7000),
    (2021, 4, "EU", 6000),
]

# Create DataFrame
columns = ["year", "quarter", "region", "revenue"]
df = spark.createDataFrame(data, columns)
df.show()
+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021|      1|    US|   5000|
|2021|      1|    EU|   4000|
|2021|      2|    US|   5500|
|2021|      2|    EU|   4500|
|2021|      3|    US|   6000|
|2021|      3|    EU|   5000|
|2021|      4|    US|   7000|
|2021|      4|    EU|   6000|
+----+-------+------+-------+

1. How to Pivot PySpark DataFrame?

Pivoting a DataFrame (i.e., convert it from long format to wide format) is a transformation technique that involves converting rows into columns

# Execute the pivot operation

pivot_df = df.groupBy("year", "quarter").pivot("region").sum("revenue")

pivot_df.show()
+----+-------+----+----+
|year|quarter|  EU|  US|
+----+-------+----+----+
|2021|      2|4500|5500|
|2021|      1|4000|5000|
|2021|      3|5000|6000|
|2021|      4|6000|7000|
+----+-------+----+----+

As shown above, we have successfully pivoted the data by region, displaying the revenue for each quarter in separate columns for the US and EU regions.

2. How to Unpivot PySpark DataFrame?

Unpivot is a reverse operation Pivot

To unpivot a DataFrame (i.e., convert it from wide format to long format), you can use the stack function with a combination of select and expr

# Method:1 Unpivot PySpark DataFrame

from pyspark.sql.functions import expr

unpivotExpr = "stack(2, 'EU',EU, 'US', US) as (region,revenue)"

unPivotDF = pivot_df.select("year","quarter", expr(unpivotExpr)).where("revenue is not null")

unPivotDF.show()
+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021|      2|    EU|   4500|
|2021|      2|    US|   5500|
|2021|      1|    EU|   4000|
|2021|      1|    US|   5000|
|2021|      3|    EU|   5000|
|2021|      3|    US|   6000|
|2021|      4|    EU|   6000|
|2021|      4|    US|   7000|
+----+-------+------+-------+
# Method:2 Unpivot PySpark DataFrame

from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
def melt_df(
        df: DataFrame,
        id_vars: Iterable[str], value_vars: Iterable[str],
        var_name: str="variable", value_name: str="value") -> DataFrame:

    _vars_and_vals = array(*(
        struct(lit(c).alias(var_name), col(c).alias(value_name))
        for c in value_vars))

    _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))

    cols = id_vars + [
            col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
    return _tmp.select(*cols)

df_melt = melt_df(pivot_df, id_vars = ['year', 'quarter'], value_vars = ['EU','US'], 
                        var_name = 'region', value_name = 'revenue')
df_melt.show()

+----+-------+------+-------+
|year|quarter|region|revenue|
+----+-------+------+-------+
|2021|      2|    EU|   4500|
|2021|      2|    US|   5500|
|2021|      1|    EU|   4000|
|2021|      1|    US|   5000|
|2021|      3|    EU|   5000|
|2021|      3|    US|   6000|
|2021|      4|    EU|   6000|
|2021|      4|    US|   7000|
+----+-------+------+-------+

Limitations and Best Practices

Although pivoting is a powerful data transformation technique, it has some limitations:

  1. Pivoting may result in a wide DataFrame with numerous columns, which can be challenging to analyze or visualize.

  2. If the pivot column has a high number of unique values, the resulting DataFrame may become extremely large, potentially exceeding available memory and causing performance issues.

To address these limitations, consider the following best practices:

  1. Filter data before pivoting to include only pertinent values.

  2. Set the optional values parameter to limit the number of pivoted columns.

  3. Whenever possible, use the pivot function with an aggregation function to reduce the DataFrame’s size and enhance processing efficiency.

Conclusion

In this blog post, we covered the concept of pivoting and the PySpark pivot function, walking you through an example of analyzing sales data. We also discussed the limitations of pivoting and shared some best practices for optimal usage.

With this newfound knowledge, you are now better equipped to leverage the power of PySpark’s pivot function in your big data projects.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science