Install PySpark on Windows – A Step-by-Step Guide to Install PySpark on Windows with Code Examples

Introduction

Apache Spark is an open-source, distributed computing system that provides a fast and general-purpose cluster-computing framework for big data processing.

PySpark is the Python library for Spark, and it enables you to use Spark with the Python programming language.

This blog post will guide you through the process of installing PySpark on your Windows operating system and provide code examples to help you get started.

Prerequisites

1. Python 3.6 or later: Download and install Python from the official website (https://www.python.org/downloads/). Make sure to add Python to your PATH during installation.

2. Java 8: Download and install the Java Development Kit (JDK) 8 from Oracle’s website (https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html). Set the JAVA_HOME environment variable to the installation directory.

1. Install Apache Spark

Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html). Select the package type as “Pre-built for Apache Hadoop”.
Extract the downloaded .tgz file to a directory, e.g., C:\spark.
Set the SPARK_HOME environment variable to the extracted directory path, e.g., C:\spark.

2. Install Hadoop

Download the latest version of Hadoop from the official website (https://hadoop.apache.org/releases.html).
Extract the downloaded .tar.gz file to a directory, e.g., C:\hadoop.
Set the HADOOP_HOME environment variable to the extracted directory path, e.g., C:\hadoop.

3. Install PySpark using pip

Open a Command Prompt with administrative privileges and execute the following command to install PySpark using the Python package manager pip:

pip install findspark
pip install pyspark

4. Install winutils.exe

Since Hadoop is not natively supported on Windows, we need to use a utility called ‘winutils.exe’ to run Spark.

Download the appropriate version of winutils.exe for your Hadoop version from the following repository: https://github.com/steveloughran/winutils.

Create a new directory called ‘hadoop’ in your C: drive (C:\hadoop) and a subdirectory called ‘bin’ (C:\hadoop\bin). Place the downloaded ‘winutils.exe’ file in the ‘bin’ directory.

5. Set the Environment Variables

a) Open the System Properties dialog by right-clicking on ‘This PC’ or ‘Computer’, then selecting ‘Properties’.

b) Click on ‘Advanced system settings’ and then the ‘Environment Variables’ button.

c) Under ‘System variables’, click on the ‘New’ button and add the following environment

### variables:

    Variable Name: HADOOP_HOME

    Variable Value: C:\hadoop

    Variable Name: SPARK_HOME

    Variable Value: %USERPROFILE%\AppData\Local\Programs\Python\Python{your_python_version}\Lib\site-packages\pyspark

    Replace {your_python_version} with your installed Python version, e.g., Python39 for Python 3.9.

d) Edit the ‘Path’ variable under ‘System variables’ by adding the following entries:

    %HADOOP_HOME%\bin

    %SPARK_HOME%\bin

e) Click ‘OK’ to save the changes.

6. Test the PySpark Installation

To test the PySpark installation, open a new Command Prompt and enter the following command:

pyspark

If everything is set up correctly, you should see the PySpark shell starting up, and you can begin using PySpark for your big data processing tasks.

7. Example Code

Here’s a simple example of using PySpark to count the number of occurrences of each word in a text file:

import findspark
findspark.init()

from pyspark import SparkConf, SparkContext

# Configure Spark
conf = SparkConf().setAppName("WordCount")
sc = SparkContext(conf=conf)

# Read input file
text_file = sc.textFile("input.txt")

# Perform word count
word_counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

# Save results to a file
word_counts.saveAsTextFile("output")

# Stop Spark context
sc.stop()

Create an input file named input.txt with some text content.

Run the Python script using the following command:

spark-submit word_count.py

After the script finishes executing, you should see an “output” folder containing the word count results.

Conclusion

Congratulations! You have successfully installed PySpark on your Windows operating system and executed a simple word count example.

You can now start exploring the powerful features of PySpark to process large datasets and build scalable data processing pipelines.

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Pyspark