Menu

PySpark orderBy() and sort() – How to Sort PySpark DataFrame

Apache Spark is a widely-used open-source distributed computing system that provides a fast and efficient platform for large-scale data processing. PySpark, the Python library for Spark, allows you to harness the power of Spark using Python’s simplicity and versatility.

In this blog post, we’ll dive into PySpark’s orderBy() and sort() functions, understand their differences, and see how they can be used to sort data in DataFrames.

Lets discuss below list of topics

The orderBy() Function

The sort() Function

Difference between orderBy() and sort()

Example Code: Sorting a DataFrame using orderBy() and sort()

PySpark DataFrames

In PySpark, DataFrames are the primary abstraction for working with structured data. A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database. DataFrames can be created from various data sources, including structured data files, Hive, and more.

import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("PySpark orderBy() and sort() Example") \
    .getOrCreate()

# Sample data
data = [
    ("Alice", 30, "New York"),
    ("Bob", 28, "San Francisco"),
    ("Charlie", 34, "Los Angeles"),
    ("Diana", 29, "Chicago")
]

# Create a DataFrame
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)

df.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|  Alice| 30|     New York|
|    Bob| 28|San Francisco|
|Charlie| 34|  Los Angeles|
|  Diana| 29|      Chicago|
+-------+---+-------------+

The orderBy() Function

The orderBy() function in PySpark is used to sort a DataFrame based on one or more columns. It takes one or more columns as arguments and returns a new DataFrame sorted by the specified columns.

Syntax:

DataFrame.orderBy(*cols, ascending=True)

Parameters:

*cols: Column names or Column expressions to sort by.
ascending (optional): Whether to sort in ascending order. Default is True.

The sort() Function

The sort() function is an alias of orderBy() and has the same functionality. The syntax and parameters are identical to orderBy().

Syntax:

DataFrame.sort(*cols, ascending=True)

Difference between orderBy() and sort()

There is no functional difference between orderBy() and sort() in PySpark. The sort() function is simply an alias for orderBy(). You can use either function based on your preference.

Example Code 1: Sorting a DataFrame using orderBy()

# Sort the DataFrame using orderBy()

sorted_by_age = df.orderBy("Age")
sorted_by_age.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|    Bob| 28|San Francisco|
|  Diana| 29|      Chicago|
|  Alice| 30|     New York|
|Charlie| 34|  Los Angeles|
+-------+---+-------------+
# Sort by multiple columns using orderBy()

sorted_by_age_and_city = df.orderBy(["Age", "City"], ascending=[True, False])
sorted_by_age_and_city.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|    Bob| 28|San Francisco|
|  Diana| 29|      Chicago|
|  Alice| 30|     New York|
|Charlie| 34|  Los Angeles|
+-------+---+-------------+

Example Code 2: Sorting a DataFrame using sort()

# Sort the DataFrame using sort()

sorted_by_age = df.sort("Age")
sorted_by_age.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|    Bob| 28|San Francisco|
|  Diana| 29|      Chicago|
|  Alice| 30|     New York|
|Charlie| 34|  Los Angeles|
+-------+---+-------------+
# Sort by multiple columns using sort()

sorted_by_age_and_city = df.sort(["Age", "City"], ascending=[True, False])
sorted_by_age_and_city.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|    Bob| 28|San Francisco|
|  Diana| 29|      Chicago|
|  Alice| 30|     New York|
|Charlie| 34|  Los Angeles|
+-------+---+-------------+

Let’s explore more code examples for orderBy() and sort() functions in PySpark

Example Code 3: Sorting a DataFrame using column expressions

from pyspark.sql.functions import desc, asc

# Sort the DataFrame by age in descending order using column expressions
sorted_by_age_desc_expr = df.orderBy(desc("Age"))
sorted_by_age_desc_expr.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|Charlie| 34|  Los Angeles|
|  Alice| 30|     New York|
|  Diana| 29|      Chicago|
|    Bob| 28|San Francisco|
+-------+---+-------------+
# Sort the DataFrame by city in ascending order using column expressions
sorted_by_city_asc_expr = df.sort(asc("City"))
sorted_by_city_asc_expr.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|  Diana| 29|      Chicago|
|Charlie| 34|  Los Angeles|
|  Alice| 30|     New York|
|    Bob| 28|San Francisco|
+-------+---+-------------+

Example Code 4: Sorting a DataFrame with NULL values

data_with_nulls = [
    ("Alice", None, "New York"),
    ("Bob", 28, None),
    ("Charlie", 34, "Los Angeles"),
    ("Diana", 29, "Chicago")
]

# Create a DataFrame with NULL values
df_with_nulls = spark.createDataFrame(data_with_nulls, columns)

# Sort the DataFrame with NULL values in Age column (NULLs appear last)
sorted_with_nulls = df_with_nulls.orderBy("Age", ascending=True, nulls_last=True)
sorted_with_nulls.show()
+-------+----+-----------+
|   Name| Age|       City|
+-------+----+-----------+
|  Alice|null|   New York|
|    Bob|  28|       null|
|  Diana|  29|    Chicago|
|Charlie|  34|Los Angeles|
+-------+----+-----------+
# Sort the DataFrame with NULL values in City column (NULLs appear first)
sorted_with_nulls_alt = df_with_nulls.sort("City", ascending=True, nulls_first=True)
sorted_with_nulls_alt.show()
+-------+----+-----------+
|   Name| Age|       City|
+-------+----+-----------+
|    Bob|  28|       null|
|  Diana|  29|    Chicago|
|Charlie|  34|Los Angeles|
|  Alice|null|   New York|
+-------+----+-----------+

Example Code 5: Sorting a DataFrame using a custom sorting order

from pyspark.sql.functions import col, when

# Define a custom sorting order for cities
city_order = ["New York", "Los Angeles", "Chicago", "San Francisco"]

# Create a custom sorting column
custom_sort_col = when(col("City") == city_order[0], 0) \
    .when(col("City") == city_order[1], 1) \
    .when(col("City") == city_order[2], 2) \
    .when(col("City") == city_order[3], 3) \
    .otherwise(4)

# Sort the DataFrame using the custom sorting order
sorted_by_custom_order = df.orderBy(custom_sort_col)
sorted_by_custom_order.show()
+-------+---+-------------+
|   Name|Age|         City|
+-------+---+-------------+
|  Alice| 30|     New York|
|Charlie| 34|  Los Angeles|
|  Diana| 29|      Chicago|
|    Bob| 28|San Francisco|
+-------+---+-------------+

Conclusion

In this blog post, we covered the basics of PySpark’s orderBy() and sort() functions, their similarities, and how to use them to sort DataFrames. With this knowledge, you can now efficiently sort and manipulate large-scale data

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science