PySpark Drop Columns – Eliminate Unwanted Columns in PySpark DataFrame with Ease

Welcome to this detailed blog post on using PySpark’s Drop() function to remove columns from a DataFrame. Lets delve into the mechanics of the Drop() function and explore various use cases to understand its versatility and importance in data manipulation.

This post is a perfect starting point for those looking to expand their understanding of PySpark and improve their data wrangling skills.

Creating a DataFrame

Before we dive into the Drop() function, let’s create a DataFrame to work with. In this example, we will create a simple DataFrame with four columns: “name”, “age”, “city”, and “gender.”

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Drop Column Example") \
    .getOrCreate()

data = [("Alice", 30, "New York", "F"),
        ("Bob", 28, "San Francisco", "M"),
        ("Cathy", 29, "Los Angeles", "F"),
        ("David", 32, "Chicago", "M")]

columns = ["name", "age", "city", "gender"]

df = spark.createDataFrame(data, columns)
df.show()

+-----+---+-------------+------+
| name|age|         city|gender|
+-----+---+-------------+------+
|Alice| 30|     New York|     F|
|  Bob| 28|San Francisco|     M|
|Cathy| 29|  Los Angeles|     F|
|David| 32|      Chicago|     M|
+-----+---+-------------+------+

Different ways to drop columns in PySpark DataFrame

Dropping a Single Column
Dropping Multiple Columns
Dropping Columns Conditionally
Dropping Columns Using Regex Pattern

1. Dropping a Single Column

The Drop() function can be used to remove a single column from a DataFrame. The syntax is as follows

df = df.drop("gender")

df.show()

+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

2. Dropping Multiple Columns:

You can also use the Drop() function to remove multiple columns from a DataFrame. Simply pass a list of column names to the function

For example, let’s remove both “age” and “gender” columns

df = df.drop("age", "gender")

df.show()

+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

Alternatively, you can use a list of column names

columns_to_drop = ["age", "gender"]

df = df.drop(*columns_to_drop)

df.show()

+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

3. Dropping Columns Conditionally

You might want to drop columns based on a specific condition. You can use the Drop() function in combination with the “if” statement to achieve this

if "gender" in df.columns:
    df = df.drop("gender")

df.show()

+-----+---+-------------+
| name|age|         city|
+-----+---+-------------+
|Alice| 30|     New York|
|  Bob| 28|San Francisco|
|Cathy| 29|  Los Angeles|
|David| 32|      Chicago|
+-----+---+-------------+

4. Dropping Columns Using Regex Pattern

You can use the “drop()” function in combination with a regular expression (regex) pattern to drop multiple columns matching the pattern.

from pyspark.sql.functions import col
import re

regex_pattern = "gender|age"
df = df.select([col(c) for c in df.columns if not re.match(regex_pattern, c)])

df.show()

+-----+-------------+
| name|         city|
+-----+-------------+
|Alice|     New York|
|  Bob|San Francisco|
|Cathy|  Los Angeles|
|David|      Chicago|
+-----+-------------+

Conclusion

In this blog post, we learned about the PySpark Drop() function and its various use cases. We explored how to remove single and multiple columns, drop columns conditionally, and remove columns using a regex pattern.

With a solid understanding of the PySpark Drop() function, you can now effectively manipulate your data to suit your needs.

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Pyspark

PySpark Drop Columns – Eliminate Unwanted Columns in PySpark DataFrame with Ease

Creating a DataFrame

Different ways to drop columns in PySpark DataFrame

1. Dropping a Single Column

2. Dropping Multiple Columns:

3. Dropping Columns Conditionally

4. Dropping Columns Using Regex Pattern

Conclusion

More Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: