machine learning +
PySpark Exercises – 101 PySpark Exercises for Data Analysis
PySpark Drop Columns – Eliminate Unwanted Columns in PySpark DataFrame with Ease
Detailed blog post on using PySpark's Drop() function to remove columns from a DataFrame, explore various use cases to understand its versatility and importance in data manipulation
Welcome to this detailed blog post on using PySpark’s Drop() function to remove columns from a DataFrame. Lets delve into the mechanics of the Drop() function and explore various use cases to understand its versatility and importance in data manipulation.
This post is a perfect starting point for those looking to expand their understanding of PySpark and improve their data wrangling skills.
Creating a DataFrame
Before we dive into the Drop() function, let’s create a DataFrame to work with. In this example, we will create a simple DataFrame with four columns: “name”, “age”, “city”, and “gender.”
python
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark Drop Column Example") \
.getOrCreate()
data = [("Alice", 30, "New York", "F"),
("Bob", 28, "San Francisco", "M"),
("Cathy", 29, "Los Angeles", "F"),
("David", 32, "Chicago", "M")]
columns = ["name", "age", "city", "gender"]
df = spark.createDataFrame(data, columns)
df.show()
python
+-----+---+-------------+------+
| name|age| city|gender|
+-----+---+-------------+------+
|Alice| 30| New York| F|
| Bob| 28|San Francisco| M|
|Cathy| 29| Los Angeles| F|
|David| 32| Chicago| M|
+-----+---+-------------+------+
Different ways to drop columns in PySpark DataFrame
- Dropping a Single Column
- Dropping Multiple Columns
- Dropping Columns Conditionally
- Dropping Columns Using Regex Pattern
1. Dropping a Single Column
The Drop() function can be used to remove a single column from a DataFrame. The syntax is as follows
python
df = df.drop("gender")
df.show()
python
+-----+-------------+
| name| city|
+-----+-------------+
|Alice| New York|
| Bob|San Francisco|
|Cathy| Los Angeles|
|David| Chicago|
+-----+-------------+
2. Dropping Multiple Columns:
You can also use the Drop() function to remove multiple columns from a DataFrame. Simply pass a list of column names to the function
For example, let’s remove both “age” and “gender” columns
python
df = df.drop("age", "gender")
df.show()
python
+-----+-------------+
| name| city|
+-----+-------------+
|Alice| New York|
| Bob|San Francisco|
|Cathy| Los Angeles|
|David| Chicago|
+-----+-------------+
Alternatively, you can use a list of column names
python
columns_to_drop = ["age", "gender"]
df = df.drop(*columns_to_drop)
df.show()
python
+-----+-------------+
| name| city|
+-----+-------------+
|Alice| New York|
| Bob|San Francisco|
|Cathy| Los Angeles|
|David| Chicago|
+-----+-------------+
3. Dropping Columns Conditionally
You might want to drop columns based on a specific condition. You can use the Drop() function in combination with the “if” statement to achieve this
python
if "gender" in df.columns:
df = df.drop("gender")
df.show()
python
+-----+---+-------------+
| name|age| city|
+-----+---+-------------+
|Alice| 30| New York|
| Bob| 28|San Francisco|
|Cathy| 29| Los Angeles|
|David| 32| Chicago|
+-----+---+-------------+
4. Dropping Columns Using Regex Pattern
You can use the “drop()” function in combination with a regular expression (regex) pattern to drop multiple columns matching the pattern.
python
from pyspark.sql.functions import col
import re
regex_pattern = "gender|age"
df = df.select([col(c) for c in df.columns if not re.match(regex_pattern, c)])
df.show()
python
+-----+-------------+
| name| city|
+-----+-------------+
|Alice| New York|
| Bob|San Francisco|
|Cathy| Los Angeles|
|David| Chicago|
+-----+-------------+
Conclusion
In this blog post, we learned about the PySpark Drop() function and its various use cases. We explored how to remove single and multiple columns, drop columns conditionally, and remove columns using a regex pattern.
With a solid understanding of the PySpark Drop() function, you can now effectively manipulate your data to suit your needs.
Free Course
Master Core Python — Your First Step into AI/ML
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Trusted by 50,000+ learners
Related Course
Master PySpark — Hands-On
Join 5,000+ students at edu.machinelearningplus.com
Explore Course

