PySpark: Dataframe Partitions Part 3

Pradeep

PySpark: Dataframe Partitions Part 3

This tutorial is continuation of the part 1 and part 2 which explains how to partition a dataframe randomly or based on specified column(s) of a dataframe and some of the partition related operations.

Following topics will be covered on these pages related to partitions, click on item in the below list and it will take you to the respective section of the page(s):
Sample Data: Dataset used in the below examples can be downloaded from here.
```
df = spark.read.csv("file:///path_to_files/csv_with_duplicates_and_nulls.csv",header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|   null|
|   50|Snowflake|  RDBMS|
|   51|     null|CloudDB|
+-----+---------+-------+

df_update = df.repartition(4)
```

➠ Shuffle Partitions: Some of the dataframe operations such as groupBy, sorting, dropduplicates, distinct will result in shuffle partition. This mean that the data present in current partitions will be reshuffled into new partitions and the number of partitions in target dataframe will be equivalent to the value set for "spark.sql.shuffle.partitions" property. By default, Shuffle partitions property value is 200.

Get Shuffle Partitions Property Value: This configuration can be viewed using getConf function of sqlContext.
- ```
sqlContext.getConf("spark.sql.shuffle.partitions")

Output: 200
```

Set Shuffle Partitions Property Value: Spark SQL can be used to set the value of shuffle partitions.


spark.sql("set spark.sql.shuffle.partitions=100")


sqlContext.getConf("spark.sql.shuffle.partitions")

Output: 100

➠ Rows in each Partition: Column function spark_partition_id can be used to get number of rows in each partition.

Example 1: Aggregate function is used along with spark_partition_id column function to get the required result.


from pyspark.sql.functions import spark_partition_id

df_update.groupBy(spark_partition_id()).count().show()

+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
|                   1|    2|
|                   3|    2|
|                   2|    1|
|                   0|    2|
+--------------------+-----+

➠ getNumPartitions: RDD function getNumPartitions can be used to get the number of partition in a dataframe.

Example 1: Dataframe "df" was converted to RDD using rdd attribute and then getNumPartitions function was applied on it to get number of partitions.
```
df_update.rdd.getNumPartitions()

Output: 4
```

This tutorial is continuation of the part 1 and part 2 which explains how to partition a dataframe and some of the partition related operations.

dbmstutorials.com

PySpark: Dataframe Partitions Part 3