# PySpark Statistics Variance – Understanding Variance a Deep Dive with PySpark

Let’s dive into the concept Variance, the formula to calculate Variance, and how to compute in PySpark, a powerful open-source data processing engine.

When analyzing data, it’s essential to understand the underlying concepts of variability and dispersion. Two key measures for this are variance

## What is Variance?

Variance is a measure of dispersion in a dataset. It quantifies how far individual data points in a distribution are from the mean. In other words, it tells us how spread out the data points are. A high variance indicates that the data points are far from the mean, while a low variance signifies that the data points are close to the mean.

## Sample Variance : s^2 = Σ (xi – x̄)^2 / (n – 1)

Where:

σ is the Standard Deviation of population
s is the Standard Deviation of sample
xi represents EACH data point in the dataset
μ is the mean (average) of the dataset
x̄ is the mean (average) of the dataset
N is the number of data points in the population dataset
n is the number of data points in the sample dataset
Σ denotes the sum of the squared differences between each data point and the mean


## Importance of Variance in statistics and machine learning:

A. Data analysis: Variance helps in identifying how much the data points deviate from the mean, providing insights into the data distribution and helping analysts make informed decisions.

B. Model performance: In machine learning, variance is used to measure the error due to the sensitivity of a model to small fluctuations in the training set. High variance indicates overfitting, while low variance suggests underfitting.

C. Feature selection: Variance can be used as a criterion for feature selection in machine learning. Features with low variance may not contribute much to the model’s predictive power, and removing them can help in reducing the complexity of the model.

## 1. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, stddev, col

spark = SparkSession.builder \
.appName("Variance") \
.getOrCreate()


## 2. Preparing the Sample Data

To demonstrate the different methods of calculating the Variance, we’ll use a sample dataset containing three columns. First, let’s load the data into a DataFrame:

# Create a sample DataFrame
data = [("A", 10, 15), ("B", 20, 22), ("C", 30, 11), ("D", 40, 8), ("E", 50, 33)]
columns = ["Name", "Score_1", "Score_2"]
df = spark.createDataFrame(data, columns)

df.show()

+----+-------+-------+
|Name|Score_1|Score_2|
+----+-------+-------+
|   A|     10|     15|
|   B|     20|     22|
|   C|     30|     11|
|   D|     40|      8|
|   E|     50|     33|
+----+-------+-------+


## 3. How to calculate Variance of list using PySpark RDD’s variance() function

data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)

variance = rdd.variance()
print("Population Variance:", variance)

Population Variance: 200.0


Manually calculating sample variance using RDD’s map() and reduce() functions

data = [10, 20, 30, 40, 50]
rdd = spark.sparkContext.parallelize(data)

mean = rdd.mean()
n = rdd.count()

variance = rdd.map(lambda x: (x - mean) ** 2).reduce(lambda x, y: x + y) / (n - 1)
print("Sample Variance:", variance)

Sample Variance: 250.0


## 4. How to calculate Variance of PySpark DataFrame columns

PySpark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing. Here are different ways to calculate Variance using PySpark:

## A. Using DataFrame’s agg() function with built-in variance() function

from pyspark.sql.functions import var_pop
from pyspark.sql.functions import var_samp

variance_pop = df.agg(var_pop("Score_1").alias("Population Variance"))

variance_samp = df.agg(var_samp("Score_1").alias("Sample Variance"))

variance_pop.show()
variance_samp.show()

+-------------------+
|Population Variance|
+-------------------+
|              200.0|
+-------------------+

+---------------+
|Sample Variance|
+---------------+
|          250.0|
+---------------+


## B. Using DataFrame’s describe() function and manually calculating variance

summary_stats = df.describe()

# Calculate mean and count from the summary statistics
mean = float(summary_stats.filter(col("summary") == "mean").select("Score_1").collect()[0][0])
count = int(summary_stats.filter(col("summary") == "count").select("Score_1").collect()[0][0])

variance = df.select(((col("Score_1") - mean) ** 2).alias("squared_difference")) \
.agg({"squared_difference": "sum"}) \
.collect()[0][0] / count

print("Population Variance:", variance)

Population Variance: 200.0


## C. Using the selectExpr() function with SQL expressions

# Calculate Standard Deviation (population)
variance_pop = df.selectExpr("var_pop(Score_1)").collect()[0][0]

# Calculate Standard Deviation (sample)
variance_samp = df.selectExpr("var_samp(Score_1)").collect()[0][0]

# Print the result
print("Population Variance:", variance_pop)
print("Sample Variance:", variance_samp)

Population Variance: 200.0
Sample Variance: 250.0


## Conclusion

Understanding variance is crucial for interpreting the variability and dispersion of data. PySpark offers a robust and scalable solution to compute these measures for large datasets. By following the steps outlined in this blog post, you can effectively analyze your data and draw meaningful insights from it.

Course Preview