In this blog post, we will focus on one of the common data wrangling tasks in PySpark – renaming columns. We will explore different ways to rename columns in a PySpark DataFrame and illustrate the process with example code.
Different ways to rename columns in a PySpark DataFrame
- Renaming Columns Using ‘withColumnRenamed’
-
Renaming Columns Using ‘select’ and ‘alias’
-
Renaming Columns Using ‘toDF’
-
Renaming Multiple Columns
Lets start by importing the necessary libraries, initializing a PySpark session and create a sample DataFrame to work with
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark Rename Columns").getOrCreate()
from pyspark.sql import Row
data = [Row(name="Alice", age=25, city="New York"),
Row(name="Bob", age=30, city="San Francisco"),
Row(name="Cathy", age=35, city="Los Angeles")]
sample_df = spark.createDataFrame(data)
sample_df.show()
+-----+---+-------------+
| name|age| city|
+-----+---+-------------+
|Alice| 25| New York|
| Bob| 30|San Francisco|
|Cathy| 35| Los Angeles|
+-----+---+-------------+
1. Renaming Columns Using ‘withColumnRenamed’
The ‘withColumnRenamed’ method is a simple way to rename a single column in a DataFrame
renamed_df = sample_df.withColumnRenamed("age", "user_age")
renamed_df.show()
+-----+--------+-------------+
| name|user_age| city|
+-----+--------+-------------+
|Alice| 25| New York|
| Bob| 30|San Francisco|
|Cathy| 35| Los Angeles|
+-----+--------+-------------+
2. Renaming Columns Using ‘select’ and ‘alias’
You can also use the ‘select’ and ‘alias’ methods to rename columns
from pyspark.sql.functions import col
renamed_df = sample_df.select(col("name"), col("age").alias("user_age"), col("city"))
renamed_df.show()
+-----+--------+-------------+
| name|user_age| city|
+-----+--------+-------------+
|Alice| 25| New York|
| Bob| 30|San Francisco|
|Cathy| 35| Los Angeles|
+-----+--------+-------------+
3. Renaming Columns Using ‘toDF’
Another approach is to use the ‘toDF’ method to rename columns by passing a list of new column names:
renamed_df = sample_df.toDF("user_name", "user_age", "user_city")
renamed_df.show()
+---------+--------+-------------+
|user_name|user_age| user_city|
+---------+--------+-------------+
| Alice| 25| New York|
| Bob| 30|San Francisco|
| Cathy| 35| Los Angeles|
+---------+--------+-------------+
4. Renaming Multiple Columns
If you need to rename multiple columns at once, you can chain ‘withColumnRenamed’ methods
renamed_df = sample_df.withColumnRenamed("name", "user_name") \
.withColumnRenamed("age", "user_age") \
.withColumnRenamed("city", "user_city")
renamed_df.show()
+---------+--------+-------------+
|user_name|user_age| user_city|
+---------+--------+-------------+
| Alice| 25| New York|
| Bob| 30|San Francisco|
| Cathy| 35| Los Angeles|
+---------+--------+-------------+
Alternatively, you can use a loop with ‘withColumnRenamed’ to rename multiple columns
columns_to_rename = {"name": "user_name", "age": "user_age", "city": "user_city"}
renamed_df = sample_df
for old_name, new_name in columns_to_rename.items():
renamed_df = renamed_df.withColumnRenamed(old_name, new_name)
renamed_df.show()
+---------+--------+-------------+
|user_name|user_age| user_city|
+---------+--------+-------------+
| Alice| 25| New York|
| Bob| 30|San Francisco|
| Cathy| 35| Los Angeles|
+---------+--------+-------------+
spark.stop()
we explored different ways to rename columns in a PySpark DataFrame. We covered the ‘withColumnRenamed’, ‘select’ with ‘alias’, and ‘toDF’ methods, as well as techniques to rename multiple columns at once.
With this knowledge, you should be well-equipped to handle various column renaming scenarios in your PySpark projects.