PySpark Filter vs Where – Comprehensive Guide Filter Rows from PySpark DataFrame

Apache PySpark is a popular open-source distributed data processing engine built on top of the Apache Spark framework. It provides a high-level API for handling large-scale data processing tasks in Python, Scala, and Java.

One of the most common tasks when working with PySpark DataFrames is filtering rows based on certain conditions. In this blog post, we’ll discuss different ways to filter rows in PySpark DataFrames, along with code examples for each method.

Different ways to filter rows in PySpark DataFrames

1. Filtering Rows Using ‘filter’ Function

2. Filtering Rows Using ‘where’ Function

3. Filtering Rows Using SQL Queries

4. Combining Multiple Filter Conditions

Before we dive into filtering rows, let’s quickly review some basics of PySpark DataFrames. To work with PySpark DataFrames, we first need to import the necessary modules and create a SparkSession

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Filtering Rows in PySpark DataFrames") \
    .getOrCreate()

Next, let’s create a simple DataFrame to use in our examples

from pyspark.sql import Row

data = [
    Row(id=1, name="Alice", age=30),
    Row(id=2, name="Bob", age=25),
    Row(id=3, name="Charlie", age=35),
    Row(id=4, name="David", age=28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  2|    Bob| 25|
|  3|Charlie| 35|
|  4|  David| 28|
+---+-------+---+

1. Filtering Rows Using ‘filter’ Function

The filter function is one of the most straightforward ways to filter rows in a PySpark DataFrame. It takes a boolean expression as an argument and returns a new DataFrame containing only the rows that satisfy the condition.

Example: Filter rows with age greater than 30

filtered_df = df.filter(df.age > 29)

filtered_df.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  3|Charlie| 35|
+---+-------+---+

2. Filtering Rows Using ‘where’ Function

The where function is an alias for the ‘filter’ function and can be used interchangeably. It also takes a boolean expression as an argument and returns a new DataFrame containing only the rows that satisfy the condition.

Example: Filter rows with name equal to “Alice”:

filtered_df = df.where(df.name.isin(["Alice", "Charlie"]))

filtered_df.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  3|Charlie| 35|
+---+-------+---+

3. Filtering Rows Using SQL Queries

PySpark also supports executing SQL queries to filter rows in a DataFrame. First, you need to register your DataFrame as a temporary table using the ‘createOrReplaceTempView’ function. Then, you can execute SQL queries using the ‘sql’ function.

Example: Filter rows with age less than or equal to 25

df.createOrReplaceTempView("people")

filtered_df = spark.sql("SELECT * FROM people WHERE age <= 25")
filtered_df.show()

+---+----+---+
| id|name|age|
+---+----+---+
|  2| Bob| 25|
+---+----+---+

4. Combining Multiple Filter Conditions

You can combine multiple filter conditions using the ‘&’ (and), ‘|’ (or), and ‘~’ (not) operators. Make sure to use parentheses to separate different conditions, as it helps maintain the correct order of operations.

Example: Filter rows with age greater than 25 and name not equal to “David”

filtered_df = df.filter((df.age > 25) & (df.name != "David"))

filtered_df.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  3|Charlie| 35|
+---+-------+---+

Pyspark

PySpark Filter vs Where – Comprehensive Guide Filter Rows from PySpark DataFrame

Different ways to filter rows in PySpark DataFrames

1. Filtering Rows Using ‘filter’ Function

2. Filtering Rows Using ‘where’ Function

3. Filtering Rows Using SQL Queries

4. Combining Multiple Filter Conditions

1. Filtering Rows Using ‘filter’ Function

2. Filtering Rows Using ‘where’ Function

3. Filtering Rows Using SQL Queries

4. Combining Multiple Filter Conditions

Recommended

More Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: