PySpark Chi-Square Test – Understanding Chi-Square Test a Deep Dive with PySpark

Let’s explore the uses of Chi-Square in statistics and machine learning, and then demonstrate how to calculate the Chi-Square statistic in PySpark in different ways.

Let’s dive into the world of statistics and machine learning, focusing on the Chi-Square Test. This statistical test is an essential tool for many data-driven applications and is widely used to determine the relationship between categorical variables.

1. What is the Chi-Square Test?

The Chi-Square Test is a statistical hypothesis test used to determine if there is a significant association between two categorical variables in a sample. It is based on the difference between the observed frequencies in each category and the frequencies that we would expect to see under the assumption of independence (i.e., no relationship between the variables).

The resulting test statistic follows a Chi-Square distribution when the null hypothesis of independence is true.

2. Chi-Square formula

The chi-square formula is used to calculate the chi-square statistic (χ²) in order to assess the goodness of fit or the independence of two categorical variables. The formula for the chi-square statistic is as follows:

X² = Σ[(Observed frequency – Expected frequency)² / Expected frequency]

χ² = Σ [(O_ij – E_ij)² / E_ij]

where:

Σ denotes the sum across all categories or cells in the contingency table.

O_ij represents the observed frequency in a specific cell or category (i-th row and j-th column).

E_ij is the expected frequency in the same cell or category, calculated as (Row Total_i * Column Total_j) / Grand Total.

i and j index the rows and columns, respectively, in the contingency table.

3. How to interpert the Chi-Square Test Output?

Before interpreting the Chi-Square Test, it’s essential to understand the basics of hypothesis testing and what is P Value.

key steps in hypothesis testing

Formulate the null hypothesis (H0) and the alternative hypothesis (H1)
Choose a significance level (α)
perform the appropriate statistical analysis to compute the test statistic and p-value
Make a decision (reject/accept the null hypothesis)

if you are not aware of these concepts I recommend you to read this blog post what-is-p-value

How to interpret the output and test the hypothesis using the Chi-Square test

State the null and alternative hypotheses:
Null Hypothesis (H0): There is no significant association between the two categorical variables.

Alternative Hypothesis (H1): There is a significant association between the two categorical variables.
chosen significance level α

usually set at 0.05
Calculate the Chi-Square test statistic and Calculate the p-value

Example: X-square Test = 14.02, df = 3, p-value = 0.00213
Make a decision

If p-value < α, reject the null hypothesis (H0).
This means there is a significant association between the two categorical variables.

If p-value ≥ α, fail to reject(accept) the null hypothesis (H0).
This means there is no significant association between the two categorical variables.

Example: In the above example p-value < α (0.00213 < 0.05) which means reject H0 and Accept H1

This means there is a significant association between the two categorical variables

4. Uses of Chi-Square in Statistics and Machine Learning

a. Independence Testing: One of the primary uses of the Chi-Square Test is to examine the independence between two categorical variables. In this context, it can help researchers identify relationships between variables, which is particularly useful for feature selection in machine learning.

b. Goodness-of-Fit Testing: The Chi-Square Test can also be used to assess how well an observed frequency distribution fits a theoretical distribution. This is valuable for determining the suitability of a particular probability distribution in modeling data.

c. Feature Selection: In machine learning, feature selection is a crucial step to enhance model performance and reduce complexity. The Chi-Square Test can be employed to select the most relevant features for a given classification problem by identifying significant relationships between predictor variables and the target variable.

5. Import required libraries and initialize SparkSession

First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.

import findspark
findspark.init()

from pyspark import SparkFiles
from pyspark.sql import SparkSession
import pandas as pd
from scipy.stats import chi2_contingency

spark = SparkSession.builder.appName("ChiSquareTest").getOrCreate()

6. Preparing the Sample Data

Let’s use a real dataset from the UCI Machine Learning Repository called the “Adult” dataset. This dataset contains demographic and income data, and we’ll use it to examine the relationship between gender and income level

url = "https://raw.githubusercontent.com/selva86/datasets/master/adultTrain.csv"
spark.sparkContext.addFile(url)

df = spark.read.csv(SparkFiles.get("adultTrain.csv"), header=True, inferSchema=True)

# Select only required variables
df = df.select("sex", "class")
df.show(5)

+------+-----+
|   sex|class|
+------+-----+
|  Male|<=50K|
|  Male|<=50K|
|  Male|<=50K|
|  Male|<=50K|
|Female|<=50K|
+------+-----+
only showing top 5 rows

7. Variables Description:

sex : Categorical variable with gender description “Male” and “Female”

class : Categorical variable income level variable having two categories “Income <= 50K” and “Income > 50K”)

# contingency table
contingency_table = df.stat.crosstab("sex", "class")
contingency_table.show()

#Converting spark contingency_table to pandas DataFrame
contingency_table_df = contingency_table.toPandas()
contingency_table_df = contingency_table_df.set_index('sex_class')

+---------+-----+----+
|sex_class|<=50K|>50K|
+---------+-----+----+
|     Male|13984|6396|
|   Female| 8670|1112|
+---------+-----+----+

8. Perform the chi-square test

# Perform the chi-square test
chi2, p_value, degrees_of_freedom, expected_frequencies = chi2_contingency(contingency_table_df)

# Print the results
print("Chi-Square Statistic:", chi2)
print("P-Value:", p_value)
print("Degrees of Freedom:", degrees_of_freedom)

print(" ")
print("Contingency Table:")
print(contingency_table_df)

print(" ")
print("Expected Frequencies:")
print(pd.DataFrame(expected_frequencies, index=contingency_table_df.index, columns=contingency_table_df.columns))

Chi-Square Statistic: 1415.2864042410245
P-Value: 1.00155254124934e-309
Degrees of Freedom: 1

Contingency Table:
           <=50K  >50K
sex_class             
Male       13984  6396
Female      8670  1112

Expected Frequencies:
                  <=50K         >50K
sex_class                           
Male       15306.959751  5073.040249
Female      7347.040249  2434.959751

9. Results interpretation

p-value < 0.05, reject the null hypothesis (H0).

This means there is a significant association between the gender and income level

Conclusion

The Chi-Square Test is an necessary tool in the field of statistics and machine learning. Its ability to assess the association between categorical variables and to perform goodness-of-fit testing makes it a valuable technique for data analysis.

PySpark provides the necessary tools to perform the Chi-Square Test, allowing for efficient and scalable computation. Whether you choose to use the ‘ChiSquareTest’ class or compute the test statistic manually using a contingency table, the Chi-Square Test will prove to be a powerful method for uncovering hidden relationships within your data.

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Pyspark