PySpark Union – A Detailed Guide Harnessing the Power of PySpark Union

PySpark Union operation is a powerful way to combine multiple DataFrames, allowing you to merge data from different sources and perform complex data transformations with ease.

What is PySpark Union?

PySpark Union is an operation that allows you to combine two or more DataFrames with the same schema, creating a single DataFrame containing all rows from the input DataFrames.

It’s important to note that the Union operation doesn’t eliminate duplicate rows, so you may need to use the distinct() function afterward if you want to remove duplicates.

Importing necessary libraries and creating a sample DataFrames

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create a Spark session
spark = SparkSession.builder.appName("PySpark Union Example").getOrCreate()

# Define the schema
schema = StructType([
    StructField("product", StringType(), True),
    StructField("price", IntegerType(), True),
    StructField("quantity", IntegerType(), True)
])

# Create DataFrame for region A
data_A = [("apple", 3, 5), ("banana", 1, 10), ("orange", 2, 8)]
df_A = spark.createDataFrame(data_A, schema=schema)

# Create DataFrame for region B
data_B = [("apple", 3, 5), ("banana", 1, 15), ("grape", 4, 6)]
df_B = spark.createDataFrame(data_B, schema=schema)

# Create DataFrame for region C
data_C = [("apple", 3, 10), ("banana", 1, 20), ("grape", 4, 10), ("orange", 2, 7)]
df_C = spark.createDataFrame(data_C, schema=schema)

1. Union Two DataFrames

Let’s dive into some example code to see how PySpark Union can be used in practice. We will use two DataFrames with the same schema, representing sales data from two different regions.

# Perform the Union operation on two DataFrames
df_union = df_A.union(df_B)

# Show the results
df_union.show()

+-------+-----+--------+
|product|price|quantity|
+-------+-----+--------+
|  apple|    3|       5|
| banana|    1|      10|
| orange|    2|       8|
|  apple|    3|       5|
| banana|    1|      15|
|  grape|    4|       6|
+-------+-----+--------+

2. Union without Duplicates

It’s important to note that the Union operation doesn’t eliminate duplicate rows, so you may need to use the distinct() function afterward if you want to remove duplicates.

# Perform the Union operation on two DataFrames
df_union_dist = df_A.union(df_B).distinct()

# Show the results
df_union_dist.show()

+-------+-----+--------+
|product|price|quantity|
+-------+-----+--------+
|  apple|    3|       5|
| banana|    1|      10|
| orange|    2|       8|
| banana|    1|      15|
|  grape|    4|       6|
+-------+-----+--------+

3. Union Multiple DataFrames

In the previous example, we demonstrated how to perform a union operation on two DataFrames. Now, let’s take it a step further and see how we can use PySpark Union to merge multiple DataFrames.

# Perform the Union operation on multiple DataFrames
df_union_all = df_A.union(df_B).union(df_C)

# Show the results
df_union_all.show()

+-------+-----+--------+
|product|price|quantity|
+-------+-----+--------+
|  apple|    3|       5|
| banana|    1|      10|
| orange|    2|       8|
|  apple|    3|       5|
| banana|    1|      15|
|  grape|    4|       6|
|  apple|    3|      10|
| banana|    1|      20|
|  grape|    4|      10|
| orange|    2|       7|
+-------+-----+--------+

Conclusion

The PySpark Union operation is a powerful and versatile tool for combining DataFrames with the same schema. It enables you to merge data from different sources, concatenate DataFrames, and perform

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Pyspark

PySpark Union – A Detailed Guide Harnessing the Power of PySpark Union

What is PySpark Union?

Importing necessary libraries and creating a sample DataFrames

1. Union Two DataFrames

2. Union without Duplicates

3. Union Multiple DataFrames

Conclusion

More Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: