Menu

Select columns in PySpark dataframe – A Comprehensive Guide to Selecting Columns in different ways in PySpark dataframe

Apache PySpark is a powerful big data processing framework, which allows you to process large volumes of data using the Python programming language. PySpark’s DataFrame API is a powerful tool for data manipulation and analysis.

One of the most common tasks when working with DataFrames is selecting specific columns. In this blog post, we will explore different ways to select columns in PySpark DataFrames, accompanied by example code for better understanding.

1. Selecting Columns using column names

The select function is the most straightforward way to select columns from a DataFrame. You can specify the columns by their names as arguments or by using the ‘col’ function from the ‘pyspark.sql.functions’ module

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.master("local").appName("SelectColumns").getOrCreate()

data = [("Alice", 34, "Female"),
        ("Bob", 45, "Male"),
        ("Charlie", 28, "Male"),
        ("Diana", 39, "Female")]

columns = ["Name", "Age", "Gender"]

df = spark.createDataFrame(data, columns)

# Select columns using column names
selected_df1 = df.select("Name", "Age")

# Select columns using the 'col' function
selected_df2 = df.select(col("Name"), col("Age"))

df.show()

selected_df1.show()

selected_df2.show()
+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Alice| 34|Female|
|    Bob| 45|  Male|
|Charlie| 28|  Male|
|  Diana| 39|Female|
+-------+---+------+

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 28|
|  Diana| 39|
+-------+---+

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 28|
|  Diana| 39|
+-------+---+

2. Selecting Columns using the ‘[ ]’ Operator

You can also use the ‘[ ]’ operator to select specific columns from a DataFrame, similar to the pandas library.

# Select a single column using the '[]' operator
name_df = df["Name"]

# Select multiple columns using the '[]' operator
selected_df3 = df.select(df["Name"], df["Age"])

selected_df3.show()
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 28|
|  Diana| 39|
+-------+---+

3. Select Columns using index

In PySpark, you can’t directly select columns from a DataFrame using column indices. However, you can achieve this by first extracting the column names based on their indices and then selecting those columns.

# Define the column indices you want to select
column_indices = [0, 2]

# Extract column names based on indices
selected_columns = [df.columns[i] for i in column_indices]

# Select columns using extracted column names
selected_df4 = df.select(selected_columns)

# Show the result DataFrame
selected_df4.show()
+-------+------+
|   Name|Gender|
+-------+------+
|  Alice|Female|
|    Bob|  Male|
|Charlie|  Male|
|  Diana|Female|
+-------+------+

4. Selecting Columns using the ‘withColumn’ and ‘drop’ Functions

If you want to select specific columns while adding or removing columns, you can use the ‘withColumn’ function to add a new column and the ‘drop’ function to remove a column.

# Add a new column 'IsAdult' and remove the 'Gender' column
selected_df5 = df.withColumn("IsAdult", col("Age") >= 18).drop("Gender")

selected_df5.show()
+-------+---+-------+
|   Name|Age|IsAdult|
+-------+---+-------+
|  Alice| 34|   true|
|    Bob| 45|   true|
|Charlie| 28|   true|
|  Diana| 39|   true|
+-------+---+-------+

5. Selecting Columns using SQL Expressions

You can also use SQL-like expressions to select columns using the ‘selectExpr’ function. This is useful when you want to perform operations on columns while selecting them.

# Select columns with an SQL expression
selected_df6 = df.selectExpr("Name", "Age", "Age >= 18 as IsAdult")

selected_df6.show()
+-------+---+-------+
|   Name|Age|IsAdult|
+-------+---+-------+
|  Alice| 34|   true|
|    Bob| 45|   true|
|Charlie| 28|   true|
|  Diana| 39|   true|
+-------+---+-------+

Recommended

we have explored different ways to select columns in PySpark DataFrames, such as using the ‘select’, ‘[]’ operator, ‘withColumn’ and ‘drop’ functions, and SQL expressions.

Knowing how to use these techniques effectively will make your data manipulation tasks more efficient and help you unlock the full potential of PySpark.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science