Menu

PySpark Joins – A Comprehensive Guide on PySpark Joins with Example Code

Welcome to our blog post on PySpark join types. As an expert in the field, I am excited to share my knowledge with you. PySpark, the Apache Spark library for Python, provides a powerful and flexible framework for big data processing.

One of the most essential operations in data processing is joining datasets, which enables you to combine data from different sources based on a common key.

In this blog post, we will discuss the various join types supported by PySpark, explain their use cases, and provide example code for each type. So let’s dive in!

Type of Joins

  1. Inner Join

  2. Outer (Full) Join

  3. Left Join

  4. Right Join

  5. Left Semi Join

  6. Left Anti Join

  7. Cross Join

Importing necessary libraries and creating a sample DataFrames

import findspark
findspark.init()

from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("PySpark Join Types").getOrCreate()

# Create sample dataframes
df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")], ["id", "value1"])
df2 = spark.createDataFrame([(1, "X"), (2, "Y"), (4, "Z")], ["id", "value2"])

1. Inner Join

An inner join returns rows from both dataframes that have matching keys. In other words, it returns only the rows that have common keys in both dataframes. This is the default join type in PySpark.

# Perform inner join
result = df1.join(df2, on="id", how="inner")

# Show result
result.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  2|     B|     Y|
+---+------+------+

2. Outer (Full) Join

An outer join, also known as a full join, returns all rows from both dataframes. If a key is present in one dataframe but not in the other, the missing values are filled with nulls.

# Perform outer join
result = df1.join(df2, on="id", how="outer")

# Show result
result.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  2|     B|     Y|
|  3|     C|  null|
|  4|  null|     Z|
+---+------+------+

3. Left Join

A left join returns all rows from the left dataframe and the matched rows from the right dataframe. If no match is found for a key in the right dataframe, the result will contain null values.

# Perform left join
result = df1.join(df2, on="id", how="left")

# Show result
result.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  3|     C|  null|
|  2|     B|     Y|
+---+------+------+

4. Right Join

A right join returns all rows from the right dataframe and the matched rows from the left dataframe. If no match is found for a key in the left dataframe, the result will contain null values.

# Perform right join
result = df1.join(df2, on="id", how="right")

# Show result
result.show()
+---+------+------+
| id|value1|value2|
+---+------+------+
|  1|     A|     X|
|  2|     B|     Y|
|  4|  null|     Z|
+---+------+------+

5. Left Semi Join

A left semi join returns only the columns from the left dataframe for the rows with matching keys in both dataframes. It is similar to an inner join but only returns the columns from the left dataframe.

# Perform left semi join
result = df1.join(df2, on="id", how="left_semi")

# Show result
result.show()
+---+------+
| id|value1|
+---+------+
|  1|     A|
|  2|     B|
+---+------+

6. Left Anti Join

A left anti join returns the rows from the left dataframe that do not have matching keys in the right dataframe. It is the opposite of a left semi join.

# Perform left anti join
result = df1.join(df2, on="id", how="left_anti")

# Show result
result.show()
+---+------+
| id|value1|
+---+------+
|  3|     C|
+---+------+

7. Cross Join

A cross join, also known as a cartesian join, returns the cartesian product of both dataframes. It combines each row from the left dataframe with each row from the right dataframe.

# Perform cross join
result = df1.crossJoin(df2)

# Show result
result.show()
+---+------+---+------+
| id|value1| id|value2|
+---+------+---+------+
|  1|     A|  1|     X|
|  1|     A|  2|     Y|
|  1|     A|  4|     Z|
|  2|     B|  1|     X|
|  2|     B|  2|     Y|
|  2|     B|  4|     Z|
|  3|     C|  1|     X|
|  3|     C|  2|     Y|
|  3|     C|  4|     Z|
+---+------+---+------+

Recommended

In this blog post, we have explored the various join types available in PySpark, including inner, outer, left, right, left semi, left anti, and cross joins. Each join type has its own unique use case, and understanding how to use them effectively can help you manipulate and analyze large datasets with ease.

With the help of the example code provided, you should now have a solid understanding of how to perform these join operations in PySpark. Happy data processing!

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science