This tutorial is continuation of the part 1 which explains how to partition a dataframe randomly or based on specified column(s) of a dataframe.

PySpark: Dataframe Partitions Part 2

This tutorial is continuation of the part 1 which explains how to partition a dataframe randomly or based on specified column(s) of a dataframe and some of the partition related operations.


getNumPartitions: RDD function getNumPartitions can be used to get the number of partition in a dataframe.
RepartitionByRange Function: repartitionByRange() function can be used to increase or decrease number of partitions. Target dataframe after applying repartitionByRange function is range partitioned.


Coalesce Function: coalesce() function can be used to reduce partitions in a dataframe. If a larger number of partitions are requested using coalesce function then dataframe will stay at the current number of partitions.