One of the essential functions provided by PySpark is the show() method, which displays the contents of a DataFrame in a tabular format. In this blog post, we will delve into the show() function, its usage, and its various options to help you make the most of this powerful tool.
1. Understanding DataFrames in PySpark
Before we discuss the show() function, it’s essential to understand DataFrames in PySpark. A DataFrame is a distributed collection of data organized into named columns.
It is conceptually equivalent to a table in a relational database or a data frame in R or Python, but optimized for large-scale processing. You can think of a DataFrame as a spreadsheet with rows and columns.
2. PySpark show() Function
The show() function is a method available for DataFrames in PySpark. It is used to display the contents of a DataFrame in a tabular format, making it easier to visualize and understand the data.
This function is particularly useful during the data exploration and debugging phases of a project.
Syntax
DataFrame.show(n=20, truncate=True, vertical=False)
Parameters:
n: The number of rows to display. The default value is 20.
truncate: If set to True, the column content will be truncated if it is too long. The default value is True.
vertical: If set to True, the output will be displayed vertically. The default value is False.
3. Usage of PySpark show()
Let’s look at some examples of how to use the show() function in PySpark
a. Basic Usage
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark show() Example") \
.getOrCreate()
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29), ("David", 31)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
+-------+---+
| Name|Age|
+-------+---+
| Alice| 34|
| Bob| 45|
|Charlie| 29|
| David| 31|
+-------+---+
b. Display a Specific Number of Rows
df.show(2)
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
| Bob| 45|
+-----+---+
only showing top 2 rows
c. Display Contents Without Truncation
df.show(truncate=False)
+-------+---+
|Name |Age|
+-------+---+
|Alice |34 |
|Bob |45 |
|Charlie|29 |
|David |31 |
+-------+---+
d. Display Contents Vertically
df.show(vertical=True)
-RECORD 0-------
Name | Alice
Age | 34
-RECORD 1-------
Name | Bob
Age | 45
-RECORD 2-------
Name | Charlie
Age | 29
-RECORD 3-------
Name | David
Age | 31
Conclusion
The PySpark show() function is a valuable tool for displaying DataFrame contents in a tabular format