Menu

PySpark Drop Columns – Eliminate Unwanted Columns in PySpark DataFrame with Ease

Welcome to this detailed blog post on using PySpark’s Drop() function to remove columns from a DataFrame. Lets delve into the mechanics of the Drop() function and explore various use cases to understand its versatility and importance in data manipulation.

This post is a perfect starting point for those looking to expand their understanding of PySpark and improve their data wrangling skills.

Creating a DataFrame

Before we dive into the Drop() function, let’s create a DataFrame to work with. In this example, we will create a simple DataFrame with four columns: “name”, “age”, “city”, and “gender.”

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Drop Column Example") \
    .getOrCreate()

data = [("Alice", 30, "New York", "F"),
        ("Bob", 28, "San Francisco", "M"),
        ("Cathy", 29, "Los Angeles", "F"),
        ("David", 32, "Chicago", "M")]

columns = ["name", "age", "city", "gender"]

df = spark.createDataFrame(data, columns)
df.show()
+-----+---+-------------+------+
| name|age|         city|gender|
+-----+---+-------------+------+
|Alice| 30|     New York|     F|
|  Bob| 28|San Francisco|     M|
|Cathy| 29|  Los Angeles|     F|
|David| 32|      Chicago|     M|
+-----+---+-------------+------+

Different ways to drop columns in PySpark DataFrame

  1. Dropping a Single Column
  2. Dropping Multiple Columns
  3. Dropping Columns Conditionally
  4. Dropping Columns Using Regex Pattern

1. Dropping a Single Column

The Drop() function can be used to remove a single column from a DataFrame. The syntax is as follows

df = df.drop("gender")

df.show()
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

2. Dropping Multiple Columns:

You can also use the Drop() function to remove multiple columns from a DataFrame. Simply pass a list of column names to the function

For example, let’s remove both “age” and “gender” columns

df = df.drop("age", "gender")

df.show()
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

Alternatively, you can use a list of column names

columns_to_drop = ["age", "gender"]

df = df.drop(*columns_to_drop)

df.show()
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

3. Dropping Columns Conditionally

You might want to drop columns based on a specific condition. You can use the Drop() function in combination with the “if” statement to achieve this

if "gender" in df.columns:
    df = df.drop("gender")

df.show()
+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|Alice| 30|     New York|
|  Bob| 28|San Francisco|
|Cathy| 29|  Los Angeles|
|David| 32|      Chicago|
+-----+---+-------------+

4. Dropping Columns Using Regex Pattern

You can use the “drop()” function in combination with a regular expression (regex) pattern to drop multiple columns matching the pattern.

from pyspark.sql.functions import col
import re

regex_pattern = "gender|age"
df = df.select([col(c) for c in df.columns if not re.match(regex_pattern, c)])

df.show()
+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

Conclusion

In this blog post, we learned about the PySpark Drop() function and its various use cases. We explored how to remove single and multiple columns, drop columns conditionally, and remove columns using a regex pattern.

With a solid understanding of the PySpark Drop() function, you can now effectively manipulate your data to suit your needs.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science