PySpark Statistics Median – Calculating the Median in PySpark a Comprehensive Guide for Everyone

Lets explore different ways of calculating the Median using PySpark, helping you become an expert

As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python.

How to Calcualte Median?

The median is a measure of central tendency that represents the middle value in a dataset when the values are sorted in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

1) Arrange the data in ascending or descending order.

2) Determine the position of the median using the formula:

Median Position (P) = (n + 1) / 2

where 'n' is the total number of values in the dataset.

3) If the dataset has an odd number of values, the median is the value at the median position.

4) If the dataset has an even number of values, the median is the average of the values at positions (P – 0.5) and (P + 0.5).

How to use Median in Statistics and Machine Learning

1. Descriptive Statistics: The median is used to describe the central tendency of a dataset, offering a more accurate representation of the data’s center than the mean in cases of skewed distributions or the presence of outliers.

2. Exploratory Data Analysis (EDA): The median is often used during EDA to identify trends, patterns, or potential anomalies in the data.

3. Machine Learning Algorithms: In machine learning, the median is used for data preprocessing tasks such as filling missing values or normalizing data. It can also be utilized as a robust loss function in regression tasks.

4. Non-parametric methods: In machine learning, non-parametric methods make fewer assumptions about the underlying data distribution. The median is often used in these methods, such as the k-nearest neighbors algorithm, where it can be used to predict the target variable by taking the median of the k-nearest neighbors.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Calculating Median with PySpark") \
    .getOrCreate()

2. How to calculate the Median of a list

A. How to calculate the Median of a list using RDD (Resilient Distributed Dataset)

sc = spark.sparkContext

data = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
rdd = sc.parallelize(data)

sorted_rdd = rdd.sortBy(lambda x: x)
n = sorted_rdd.count()

if n % 2 == 0:
    median = (sorted_rdd.take(n // 2)[-1] + sorted_rdd.take(n // 2 + 1)[0]) / 2
else:
    median = sorted_rdd.take(n // 2 + 1)[-1]

print(f"Median: {median}")

Median: 5.0

B. How to calculate the Median of a list using PySpark approxQuantile() function.**

you first need to convert the list into a DataFrame and then use the approxQuantile() function.

# Create a list
data_list = [1, 9, 3, 4, 5, 7, 11, 8, 2, 10, 6]

# Convert the list into a DataFrame
data_df = spark.createDataFrame([(value,) for value in data_list], ["values"])

median = data_df.approxQuantile("values", [0.5], 0.01)

# Print the results
for value in median: print(f" Median: {value}")

 Median: 6.0

3. Preparing the Sample Data

To demonstrate the different methods of calculating the median, we’ll use a sample dataset containing three columns: id, age, and income. First, let’s load the data into a DataFrame:

data = [("1", 25, 40000),
        ("2", 30, 60000),
        ("3", 28, 50000),
        ("4", 35, 70000),
        ("5", 32, 55000)]

columns = ["id", "age", "income"]

df = spark.createDataFrame(data, columns)
df.show()

+---+---+------+
| id|age|income|
+---+---+------+
|  1| 25| 40000|
|  2| 30| 60000|
|  3| 28| 50000|
|  4| 35| 70000|
|  5| 32| 55000|
+---+---+------+

4. How to calculate Median of a PySpark DataFrame column

There are several ways to calculate the Median of a DataFrame column in PySpark. We’ll explore three popular methods here

A. How to calculate the Median of a PySaprk DataFrame columns using PySpark approxQuantile() function

# Calculate the median for multiple columns
def calculate_median(dataframe, columns):
    medians = {}
    for column in columns:
        # Use approxQuantile to get the median (0.5 quantile)
        median = dataframe.approxQuantile(column, [0.5], 0.0)[0]
        medians[column] = median
    return medians

columns = ["age", "income"]
median_dict = calculate_median(df, columns)

# Print the result
print("Median:")
for column, median in median_dict.items():
    print(f"{column}: {median}")

Median:
age: 30.0
income: 55000.0

B. How to calculate the Median of a PySaprk DataFrame column using PySpark Window function

Another method to calculate the median is to use the percent_rank() window function, which assigns a percentile rank to each row within a partition. This method is more accurate than the approxQuantile() but can be slower for large datasets.

from pyspark.sql.window import Window
from pyspark.sql.functions import percent_rank, col

# Calculate the median
window_spec = Window.orderBy("age")
df = df.withColumn("percent_rank", percent_rank().over(window_spec))

median = df.filter(col("percent_rank") == 0.5).collect()[0]["age"]

print("Median:", median)

Median: 30

Conclusion

We’ve explored different methods for calculating the median in PySpark. Depending on your use case and the size of your dataset, you can choose the method that best suits your needs. As you continue your journey with PySpark, understanding these techniques will undoubtedly serve you well

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Machine Learning

Pyspark