Menu

PySpark Statistics Mean – Calculating the Mean Using PySpark a Comprehensive Guide for Everyone

Lets explore different ways of calculating the mean using PySpark, helping you become an expert in no time

As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python.

Concept of Mean:

The mean, also known as the average, is a measure of central tendency that represents the sum of a set of values divided by the number of values in that set. The formula for calculating the mean is as follows

Mean (µ) = (X1 + X2 + X3 + X4 + X5) / 5

Mean (µ) = Σ (xi) / N

Where:

µ represents the mean

Σ (xi) denotes the sum of all values (xi) in the dataset

N stands for the number of values in the dataset

let’s dive into the different ways of calculating the mean using PySpark.

How to use MEAN in Statistics and Machine Learning?

In both statistics and machine learning, the mean is a fundamental concept used for various purposes. Here’s how the mean is used in each discipline

Statistics:

In statistics, the mean is a measure of central tendency that helps to summarize a dataset with a single value. It is calculated by summing all the values in the dataset and dividing by the total number of values. The mean has several applications in statistics, including:

a. Descriptive Statistics: The mean provides a basic summary of the data, giving a sense of the overall central location of the values within the dataset.

b. Inferential Statistics: The mean is used in hypothesis testing, confidence intervals, and linear regression to make inferences about the population from which the sample is drawn.

c. Probability Distributions: The mean is a key parameter for many probability distributions, such as the normal, binomial, and Poisson distributions. The mean helps to characterize the center and spread of the distribution.

Machine Learning:

In machine learning, the mean plays a crucial role in various tasks, including:

a. Data Preprocessing: The mean is often used to impute missing values, normalize data, or center the data by subtracting the mean from each value.

b. Feature Engineering: The mean can be used as a feature in machine learning models. For example, the mean value of a variable within groups or over time can provide valuable information for the model.

c. Model Evaluation: In regression tasks, the mean squared error (MSE) or mean absolute error (MAE) are common metrics used to evaluate the performance of a model, both of which involve the mean.

d. Algorithm Development: The mean is used as a key component in various machine learning algorithms, such as k-means clustering, principal component analysis (PCA), and linear regression.

1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Calculating Mean with PySpark") \
    .getOrCreate()

2. How to Calculate the Mean of a list?

you can first convert the list into an RDD (Resilient Distributed Dataset) and then use the mean() function provided by PySpark.

# Your list of numbers
data = [1, 2, 3, 4, 5]

# Convert the list to an RDD
sc = spark.sparkContext
mean = sc.parallelize(data).mean()

print("Mean of the list is:", mean)
Mean of the list is: 3.0

3. Preparing the Sample Data

To demonstrate the different methods of calculating the mean, we’ll use a sample dataset containing three columns: id, age, and income. First, let’s load the data into a DataFrame:

data = [("1", 25, 40000),
        ("2", 30, 60000),
        ("3", 28, 50000),
        ("4", 35, 70000),
        ("5", 32, 55000)]

columns = ["id", "age", "income"]

df = spark.createDataFrame(data, columns)
df.show()
+---+---+------+
| id|age|income|
+---+---+------+
|  1| 25| 40000|
|  2| 30| 60000|
|  3| 28| 50000|
|  4| 35| 70000|
|  5| 32| 55000|
+---+---+------+

4. How to calculate Mean of a PySpark DataFrame column?

There are several ways to calculate the mean of a DataFrame column in PySpark. We’ll explore three popular methods here:

A. Using the agg() Function with mean()

from pyspark.sql.functions import mean

# Calating Mean of single Column
mean_age = df.agg(mean("age"))

mean_age.show()
+--------+
|avg(age)|
+--------+
|    30.0|
+--------+
# Calating Mean of Multiple Columns
result = df.agg(mean("age").alias("avg_age"), mean("income").alias("avg_income"))

# Show results
result.show()
+-------+----------+
|avg_age|avg_income|
+-------+----------+
|   30.0|   55000.0|
+-------+----------+
# Calating Mean using the agg function and a dictionary
agg_dict = {"age": "mean", "income": "mean"}

result = df.agg(agg_dict)

# Show results
result.show()

+-----------+--------+
|avg(income)|avg(age)|
+-----------+--------+
|    55000.0|    30.0|
+-----------+--------+

B. How to calculate Mean using describe() Function?

mean_age = float(df.describe("age").filter("summary = 'mean'").select("age").collect()[0]["age"])

print(f"Mean Age: {mean_age}")
Mean Age: 30.0

Conclusion

We’ve explored three different methods for calculating the mean in PySpark. Depending on your use case and the size of your dataset, you can choose the method that best suits your needs. As you continue your journey with PySpark, understanding these techniques will undoubtedly serve you well

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science