Let’s explore the uses of Chi-Square in statistics and machine learning, and then demonstrate how to calculate the Chi-Square statistic in PySpark in different ways.
Let’s dive into the world of statistics and machine learning, focusing on the Chi-Square Test. This statistical test is an essential tool for many data-driven applications and is widely used to determine the relationship between categorical variables.
1. What is the Chi-Square Test?
The Chi-Square Test is a statistical hypothesis test used to determine if there is a significant association between two categorical variables in a sample. It is based on the difference between the observed frequencies in each category and the frequencies that we would expect to see under the assumption of independence (i.e., no relationship between the variables).
The resulting test statistic follows a Chi-Square distribution when the null hypothesis of independence is true.
2. Chi-Square formula
The chi-square formula is used to calculate the chi-square statistic (χ²) in order to assess the goodness of fit or the independence of two categorical variables. The formula for the chi-square statistic is as follows:
X² = Σ[(Observed frequency – Expected frequency)² / Expected frequency]
χ² = Σ [(O_ij – E_ij)² / E_ij]
Σ denotes the sum across all categories or cells in the contingency table.
O_ij represents the observed frequency in a specific cell or category (i-th row and j-th column).
E_ij is the expected frequency in the same cell or category, calculated as (Row Total_i * Column Total_j) / Grand Total.
i and j index the rows and columns, respectively, in the contingency table.
3. How to interpert the Chi-Square Test Output?
Before interpreting the Chi-Square Test, it’s essential to understand the basics of hypothesis testing and what is P Value.
key steps in hypothesis testing
- Formulate the null hypothesis (H0) and the alternative hypothesis (H1)
- Choose a significance level (α)
- perform the appropriate statistical analysis to compute the test statistic and p-value
- Make a decision (reject/accept the null hypothesis)
if you are not aware of these concepts I recommend you to read this blog post what-is-p-value
How to interpret the output and test the hypothesis using the Chi-Square test
- State the null and alternative hypotheses:
Null Hypothesis (H0): There is no significant association between the two categorical variables.
Alternative Hypothesis (H1): There is a significant association between the two categorical variables.
chosen significance level α
usually set at 0.05
Calculate the Chi-Square test statistic and Calculate the p-value
Example: X-square Test = 14.02, df = 3, p-value = 0.00213
Make a decision
If p-value < α, reject the null hypothesis (H0).
This means there is a significant association between the two categorical variables.
If p-value ≥ α, fail to reject(accept) the null hypothesis (H0).
This means there is no significant association between the two categorical variables.
Example: In the above example p-value < α (0.00213 < 0.05) which means reject H0 and Accept H1
This means there is a significant association between the two categorical variables
4. Uses of Chi-Square in Statistics and Machine Learning
a. Independence Testing: One of the primary uses of the Chi-Square Test is to examine the independence between two categorical variables. In this context, it can help researchers identify relationships between variables, which is particularly useful for feature selection in machine learning.
b. Goodness-of-Fit Testing: The Chi-Square Test can also be used to assess how well an observed frequency distribution fits a theoretical distribution. This is valuable for determining the suitability of a particular probability distribution in modeling data.
c. Feature Selection: In machine learning, feature selection is a crucial step to enhance model performance and reduce complexity. The Chi-Square Test can be employed to select the most relevant features for a given classification problem by identifying significant relationships between predictor variables and the target variable.
5. Import required libraries and initialize SparkSession
First, let’s import the necessary libraries and create a SparkSession, the entry point to use PySpark.
import findspark findspark.init() from pyspark import SparkFiles from pyspark.sql import SparkSession import pandas as pd from scipy.stats import chi2_contingency spark = SparkSession.builder.appName("ChiSquareTest").getOrCreate()
6. Preparing the Sample Data
Let’s use a real dataset from the UCI Machine Learning Repository called the “Adult” dataset. This dataset contains demographic and income data, and we’ll use it to examine the relationship between gender and income level
url = "https://raw.githubusercontent.com/selva86/datasets/master/adultTrain.csv" spark.sparkContext.addFile(url) df = spark.read.csv(SparkFiles.get("adultTrain.csv"), header=True, inferSchema=True) # Select only required variables df = df.select("sex", "class") df.show(5)
+------+-----+ | sex|class| +------+-----+ | Male|<=50K| | Male|<=50K| | Male|<=50K| | Male|<=50K| |Female|<=50K| +------+-----+ only showing top 5 rows
7. Variables Description:
sex : Categorical variable with gender description “Male” and “Female”
class : Categorical variable income level variable having two categories “Income <= 50K” and “Income > 50K”)
# contingency table contingency_table = df.stat.crosstab("sex", "class") contingency_table.show() #Converting spark contingency_table to pandas DataFrame contingency_table_df = contingency_table.toPandas() contingency_table_df = contingency_table_df.set_index('sex_class')
+---------+-----+----+ |sex_class|<=50K|>50K| +---------+-----+----+ | Male|13984|6396| | Female| 8670|1112| +---------+-----+----+
8. Perform the chi-square test
# Perform the chi-square test chi2, p_value, degrees_of_freedom, expected_frequencies = chi2_contingency(contingency_table_df) # Print the results print("Chi-Square Statistic:", chi2) print("P-Value:", p_value) print("Degrees of Freedom:", degrees_of_freedom) print(" ") print("Contingency Table:") print(contingency_table_df) print(" ") print("Expected Frequencies:") print(pd.DataFrame(expected_frequencies, index=contingency_table_df.index, columns=contingency_table_df.columns))
Chi-Square Statistic: 1415.2864042410245 P-Value: 1.00155254124934e-309 Degrees of Freedom: 1 Contingency Table: <=50K >50K sex_class Male 13984 6396 Female 8670 1112 Expected Frequencies: <=50K >50K sex_class Male 15306.959751 5073.040249 Female 7347.040249 2434.959751
9. Results interpretation
p-value < 0.05, reject the null hypothesis (H0).
This means there is a significant association between the gender and income level
The Chi-Square Test is an necessary tool in the field of statistics and machine learning. Its ability to assess the association between categorical variables and to perform goodness-of-fit testing makes it a valuable technique for data analysis.
PySpark provides the necessary tools to perform the Chi-Square Test, allowing for efficient and scalable computation. Whether you choose to use the ‘ChiSquareTest’ class or compute the test statistic manually using a contingency table, the Chi-Square Test will prove to be a powerful method for uncovering hidden relationships within your data.