PySpark: Dataframe Preview (Part 2)

Pradeep

PySpark: Dataframe Preview (Part 2)

This tutorial will explain how you can get 'n' rows into the Python list collection from the Spark dataframe. Python list can be further used to preview data. Below listed dataframe functions will be explained with examples, click on function name in the below list and it will take you to the respective section of the function:

Show
Head
Tail
First
Take

Collect with Limit

Sample Data: Dataset used in the below examples can be downloaded from here (department).


df=spark.read.parquet("file:///path_to_file/department.parquet")

df.show(truncate=False)
+-------+----------------------+-----------+
|dept_no|department_name       |loc_name   |
+-------+----------------------+-----------+
|100    |ACCOUNTS              |JAIPUR     |
|200    |R & D                 |NEW DELHI  |
|300    |SALES                 |BENGALURU  |
|400    |INFORMATION TECHNOLOGY|BHUBANESWAR|
+-------+----------------------+-----------+

➠ Head: head() function can be used on a dataframe to return either first row or 'n' number of records from the top as a list of rows. This should be used to output only small number of records because all the data returned by head() function will be stored in driver's memory and driver process can crash with OutOfMemoryError if data volume is very high.

Example 1: Using head() function on a dataframe without any parameter will return 1st row from the dataframe.
```
df.head()

Output:
Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR')
```

Example 2: Using head() function on a dataframe with 'n' numbered parameter will return that many number of records from the top as list of rows.


df.head(2)

Output:
[Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR'), Row(dept_no=200, department_name='R & D', loc_name='NEW DELHI')]

➠ Tail: tail() function can be used on a dataframe to return 'n' number of records from the bottom as a list of rows. This should be used to output only small number of records because all the data returned by tail() function will be stored in driver's memory and driver process can crash with OutOfMemoryError if data volume is very high.

Example 1: Using tail() function on a dataframe with 1 as parameter will return last row from the dataframe as list of rows.
```
df.tail(1)

Output:
[Row(dept_no=400, department_name='INFORMATION TECHNOLOGY', loc_name='BHUBANESWAR')]
```

Example 2: Using tail() function on a dataframe with 'n' numbered parameter will return that many number of records from the bottom as list of rows.


df.tail(2)

Output:
[Row(dept_no=300, department_name='SALES', loc_name='BENGALURU'), Row(dept_no=400, department_name='INFORMATION TECHNOLOGY', loc_name='BHUBANESWAR')]

➠ First: Similar to head() function, first() function can be used on a dataframe to return its first row.

Example 1: Using first() function on a dataframe will return first row from the dataframe.
```
df.first()

Output:
Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR')
```

➠ Take: Similar to head() function, take() function can be used on a dataframe to return 'n' number of records from the top as a list of rows. This should be used to output only small number of records because all the data returned by take() function will be stored in driver's memory and driver process can crash with OutOfMemoryError if data volume is very high.

Example 1: Using take() function on a dataframe with 1 as parameter will return first row from the dataframe as list of rows.
```
df.take(1)

Output:
[Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR')]
```

Example 2: Using take() function on a dataframe with 'n' numbered parameter will return that many number of records from the top as list of rows.


df.take(2)

Output:
[Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR'), Row(dept_no=200, department_name='R & D', loc_name='NEW DELHI')]

➠ Collect: collect() function will return all the records from dataframe as a list of rows. As this function returns all the rows from dataframe, it should be used along with limit function to output only small number of records because all the data returned by collect() function will be stored in driver's memory and driver process can crash with OutOfMemoryError if data volume is very high.

Example 1: Using collect() function on a dataframe will return all the rows from the dataframe as list of rows.


df.collect()

Output:
[Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR'), Row(dept_no=200, department_name='R & D', loc_name='NEW DELHI'), Row(dept_no=300, department_name='SALES', loc_name='BENGALURU'), Row(dept_no=400, department_name='INFORMATION TECHNOLOGY', loc_name='BHUBANESWAR')]

Example 2: limit() function can be used to limit records before using collect() function on a dataframe. Records are limited to 1 in the below example.
```
df.limit(1).collect()

Output:
[Row(dept_no=100, department_name='ACCOUNTS', loc_name='JAIPUR')]
```

This tutorial will explain how you can get 'n' rows into the Python list collection from the Spark dataframe. Python list can be further used to preview data.

dbmstutorials.com

PySpark: Dataframe Preview (Part 2)