Menu

PySpark show() – Display PySpark DataFrame Contents in Table

One of the essential functions provided by PySpark is the show() method, which displays the contents of a DataFrame in a tabular format. In this blog post, we will delve into the show() function, its usage, and its various options to help you make the most of this powerful tool.

1. Understanding DataFrames in PySpark

Before we discuss the show() function, it’s essential to understand DataFrames in PySpark. A DataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to a table in a relational database or a data frame in R or Python, but optimized for large-scale processing. You can think of a DataFrame as a spreadsheet with rows and columns.

2. PySpark show() Function

The show() function is a method available for DataFrames in PySpark. It is used to display the contents of a DataFrame in a tabular format, making it easier to visualize and understand the data.

This function is particularly useful during the data exploration and debugging phases of a project.

Syntax

DataFrame.show(n=20, truncate=True, vertical=False)

Parameters:

n: The number of rows to display. The default value is 20.

truncate: If set to True, the column content will be truncated if it is too long. The default value is True.

vertical: If set to True, the output will be displayed vertically. The default value is False.

3. Usage of PySpark show()

Let’s look at some examples of how to use the show() function in PySpark

a. Basic Usage

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark show() Example") \
    .getOrCreate()

data = [("Alice", 34), ("Bob", 45), ("Charlie", 29), ("David", 31)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()
+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 34|
|    Bob| 45|
|Charlie| 29|
|  David| 31|
+-------+---+

b. Display a Specific Number of Rows

df.show(2)
+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
+-----+---+
only showing top 2 rows

c. Display Contents Without Truncation

df.show(truncate=False)
+-------+---+
|Name   |Age|
+-------+---+
|Alice  |34 |
|Bob    |45 |
|Charlie|29 |
|David  |31 |
+-------+---+

d. Display Contents Vertically

df.show(vertical=True)
-RECORD 0-------
 Name | Alice   
 Age  | 34      
-RECORD 1-------
 Name | Bob     
 Age  | 45      
-RECORD 2-------
 Name | Charlie 
 Age  | 29      
-RECORD 3-------
 Name | David   
 Age  | 31      

Conclusion

The PySpark show() function is a valuable tool for displaying DataFrame contents in a tabular format

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science