This tutorial is continuation of the part 1 and part 2 which explains how to partition a dataframe and some of the partition related operations.

PySpark: Dataframe Partitions Part 3

This tutorial is continuation of the part 1 and part 2 which explains how to partition a dataframe randomly or based on specified column(s) of a dataframe and some of the partition related operations.

Shuffle Partitions: Some of the dataframe operations such as groupBy, sorting, dropduplicates, distinct will result in shuffle partition. This mean that the data present in current partitions will be reshuffled into new partitions and the number of partitions in target dataframe will be equivalent to the value set for "spark.sql.shuffle.partitions" property. By default, Shuffle partitions property value is 200.


Rows in each Partition: Column function spark_partition_id can be used to get number of rows in each partition.
getNumPartitions: RDD function getNumPartitions can be used to get the number of partition in a dataframe.