Install PySpark on Linux – A Step-by-Step Guide to Install PySpark on Linux with Example Code

Introduction

Apache PySpark is an open-source, powerful, and user-friendly framework for large-scale data processing. It combines the power of Apache Spark with Python’s simplicity, making it a popular choice among data scientists and engineers.

In this blog post, we will walk you through the installation process of PySpark on a Linux operating system and provide example code to get you started with your first PySpark project.

Prerequisites

Before installing PySpark, make sure that the following software is installed on your Linux machine:

Python 3.6 or later

Java Development Kit (JDK) 8 or later

Apache Spark

1. Install Java Development Kit (JDK)

First, update the package index by running:

sudo apt update

Next, install the default JDK using the following command:

sudo apt install default-jdk

Verify the installation by checking the Java version:

java -version

2. Install Apache Spark

Download the latest version of Apache Spark from the official website (https://spark.apache.org/downloads.html). At the time of writing, the latest version is Spark 3.2.0. Choose the package type as “Pre-built for Apache Hadoop 3.2 and later”.

Use the following commands to download and extract the Spark archive:

wget https://archive.apache.org/dist/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar -xvzf spark-3.2.0-bin-hadoop3.2.tgz

Move the extracted folder to the /opt directory

sudo mv spark-3.2.0-bin-hadoop3.2 /opt/spark

3. Set Up Environment Variables

Add the following lines to your ~/.bashrc file to set up the required environment variables:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Source the updated ~/.bashrc file to apply the changes:

source ~/.bashrc

4. Install PySpark

Install PySpark using pip:

pip install pyspark

5. Verify PySpark Installation

Create a new Python file called pyspark_test.py and add the following code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySpark Test").getOrCreate()

data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]

columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()

spark.stop()

Run the script using:

python pyspark_test.py

If everything is set up correctly, you should see the following output:

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
+-----+---+

Conclusion

Congratulations! You have successfully installed PySpark on your Linux operating system and executed a simple PySpark script. You can now start building more complex data processing pipelines using PySpark.

Don’t forget to explore the official PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) for more information and advanced use cases. Happy coding!

PySpark

PySpark Exercises – 101 PySpark Exercises for Data Analysis

May 19, 2023

PySpark

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

May 08, 2023

PySpark

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

May 07, 2023

PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

May 05, 2023

Pyspark

Install PySpark on Linux – A Step-by-Step Guide to Install PySpark on Linux with Example Code

Introduction

Prerequisites

1. Install Java Development Kit (JDK)

2. Install Apache Spark

3. Set Up Environment Variables

4. Install PySpark

5. Verify PySpark Installation

Conclusion

More Articles

PySpark Exercises – 101 PySpark Exercises for Data Analysis

PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

PySpark StringIndexer – A Comprehensive Guide to master PySpark StringIndexer

PySpark Outlier Detection and Treatment – A Comprehensive Guide How to handle Outlier in PySpark

PySpark Missing Data Imputation – How to handle missing values in PySpark

PySpark Variable type Identification – A Comprehensive Guide to Identifying Discrete, Categorical, and Continuous Variables in Data

Similar Articles

Complete Introduction to Linear Regression in R

How to implement common statistical significance tests and find the p value?

Logistic Regression – A Complete Tutorial With Examples in R

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos: