PySpark: Dataframe Partitions Part 1

Pradeep

PySpark: Dataframe Partitions Part 1

This tutorial will explain with examples on how to partition a dataframe randomly or based on specified column(s) of a dataframe.

By default, Spark will create as many number of partitions in dataframe as there will be number of files in the read path.
Function getNumPartitions can be used to get the number of partition in a dataframe.
Partitions are useful for processing data in parallel, as many executors can work on the data in parallel as there will be partitions in a dataframe.
If the joins are between dataframes which are partitioned on the joined columns then join will have better performance.
Partitions can be created in a dataframe while reading data or after reading data from a data source.
Number of partitions can be increased or decreased in a dataframe. However if data volume is high, this might be a costlier operation with respect to resources in the cluster.
Dataframe functions such as sort, dropduplicates, distinct will result in shuffle partition. This mean that the data present in current partitions will be reshuffled into new partitions and the number of partitions in target dataframe will be equivalent to the value set for "spark.sql.shuffle.partitions" property, default value for this property is 200

Following topics will be covered on these pages related to partitions, click on item in the below list and it will take you to the respective section of the page:

getNumPartitions: Number of Partitions
spark_partition_id: Partition_id of a Row
repartition
repartitionByRange
coalesce

Sample Data: Dataset used in the below examples can be downloaded from here.


df = spark.read.csv("file:///path_to_files/csv_with_duplicates_and_nulls.csv",header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|   null|
|   50|Snowflake|  RDBMS|
|   51|     null|CloudDB|
+-----+---------+-------+

➠ getNumPartitions: RDD function getNumPartitions can be used to get the number of partition in a dataframe.

Example 1: Dataframe "df" was converted to RDD using rdd attribute and then getNumPartitions function was applied on it to get number of partitions.
```
df.rdd.getNumPartitions()

Output: 3
```

➠ spark_partition_id: Column function spark_partition_id can be used to get the partition id to which each row belongs to in a dataframe.

Example 1: Using spark_partition_id in the select statement to get partition id to which each row belongs to in a dataframe.


from pyspark.sql.functions import spark_partition_id
df_update = df.repartition(4)

df_update.rdd.getNumPartitions()
Output: 4


df_update.select("db_id", "db_name", spark_partition_id().alias("partition#") ).show()
+-----+---------+----------+
|db_id|  db_name|partition#|
+-----+---------+----------+
|   51|     null|         0|
|   14|Snowflake|         0|
|   12| Teradata|         1|
|   22|    Mysql|         1|
|   12| Teradata|         2|
|   50|Snowflake|         3|
|   15|  Vertica|         3|
+-----+---------+----------+

Example 2: Adding partition Id column in the dataframe.


from pyspark.sql.functions import spark_partition_id
df_update = df.repartition(4)

df_update.rdd.getNumPartitions()
Output: 4


df_update = df_update.withColumn("partition#", spark_partition_id())

df_update.show()
+-----+---------+-------+----------+
|db_id|  db_name|db_type|partition#|
+-----+---------+-------+----------+
|   22|    Mysql|   null|         0|
|   12| Teradata|  RDBMS|         0|
|   51|     null|CloudDB|         1|
|   15|  Vertica|  RDBMS|         1|
|   14|Snowflake|CloudDB|         2|
|   50|Snowflake|  RDBMS|         3|
|   12| Teradata|  RDBMS|         3|
+-----+---------+-------+----------+

Example 3: Check data for a particular partition, Spark_partition_id column function is used in the below example to filter all other data except for 1st and 4th partition.


from pyspark.sql.functions import spark_partition_id
df_update = df.repartition(4)

df_update.rdd.getNumPartitions()
Output: 4


df_update = df_update.select("db_name",spark_partition_id()).filter(spark_partition_id().isin(4,1))

df_update.show()
+-------+--------------------+
|db_name|SPARK_PARTITION_ID()|
+-------+--------------------+
|  Mysql|                   1|
|Vertica|                   4|
+-------+--------------------+

➠ Repartition Function: repartition() function can be used to increase or decrease number of partitions. Target dataframe after applying repartition function is hash partitioned.

Syntax: This function can take upto 2 parameters, at-least one of the two parameters must be passed. If 1st parameter(i.e NumPartitions) is not passed then Spark will create number of partitions equivalent to the value of property "spark.sql.shuffle.partitions".
```
repartition((numPartitions, *cols)
```
- → 1st parameter can be used to specify integer value to create that many number of partitions in a dataframe.
- → 2nd parameter can be used to specify 4 type of values: single column as string, comma separated string for column names, single column expression or comma separated other column expressions.
Example 1: Increasing number of partitions (creating partitions) in a dataframe. Only 1st parameter was passed as input to repartition function.
```
df.rdd.getNumpartitins()

Output: 1


df_update = df.repartition(3)
df_update.rdd.getNumPartitions()

Output: 3
```

Example 2: Creating partitions based on single column, same value from this column will be part of same partition. This can be observed in the below example for values "Snowflake" and "Teradata" where same value belongs to the same partition in a dataframe. As NumPartitions parameter is not passed, the number of partitions in the target dataframe in this case will be equivalent to the value set for "spark.sql.shuffle.partitions" property.


df_update = df.repartition("db_name")

df_update.select("db_name",spark_partition_id()).show()
+---------+--------------------+
|  db_name|SPARK_PARTITION_ID()|
+---------+--------------------+
|     null|                  42|
|    Mysql|                  69|
|  Vertica|                 107|
|Snowflake|                 176|
|Snowflake|                 176|
| Teradata|                 191|
| Teradata|                 191|
+---------+--------------------+


df_update.rdd.getNumPartitions()
Output: 200

Example 3: Creating partitions based on the multiple columns, same value from the combination of columns will be part of same partition. This can be observed in the below example for db_name column value "Teradata" and db_id column value "12" where same combination value belongs to the same partition in a dataframe. As NumPartitions parameter is not passed, the number of partitions in the target dataframe in this case will be equivalent to the value set for "spark.sql.shuffle.partitions" property.


from pyspark.sql.functions import spark_partition_id
df_update = df.repartition("db_name", "db_id")

df_update.select("db_name","db_id",spark_partition_id()).show()
+---------+-----+--------------------+
|  db_name|db_id|SPARK_PARTITION_ID()|
+---------+-----+--------------------+
|     null|   51|                   3|
| Teradata|   12|                  51|
| Teradata|   12|                  51|
|Snowflake|   50|                  55|
|  Vertica|   15|                  77|
|    Mysql|   22|                 118|
|Snowflake|   14|                 124|
+---------+-----+--------------------+


df_update.rdd.getNumPartitions()
Output: 200

Example 4: Creating partitions based on the column name but with NumPartitions parameter to restrict number of partitions in the target dataframe. In this example, you can observe same value for column "db_name" are part of same partition.


from pyspark.sql.functions import spark_partition_id
df_update = df.repartition(2, "db_name")

df_update.select("db_name",spark_partition_id()).show()
+---------+--------------------+
|  db_name|SPARK_PARTITION_ID()|
+---------+--------------------+
|Snowflake|                   0|
|Snowflake|                   0|
|     null|                   0|
| Teradata|                   1|
|  Vertica|                   1|
| Teradata|                   1|
|    Mysql|                   1|
+---------+--------------------+


df_update.rdd.getNumPartitions()
Output: 2

Example 5: Creating partitions based on single column expression, same output value of this column expression will be part of same partition. In this example, substring was used on a column to create partitions based on first letter of the column value. As NumPartitions parameter is not passed, the number of partitions in the target dataframe in this case will be equivalent to the value set for "spark.sql.shuffle.partitions" property.


from pyspark.sql.functions import col, spark_partition_id
df_update = df.repartition(col("db_name").substr(1,1))

df_update.select("db_name",spark_partition_id()).show()
+---------+--------------------+
|  db_name|SPARK_PARTITION_ID()|
+---------+--------------------+
|     null|                  42|
| Teradata|                  44|
| Teradata|                  44|
|    Mysql|                  68|
|  Vertica|                  69|
|Snowflake|                 124|
|Snowflake|                 124|
+---------+--------------------+


df_update.rdd.getNumPartitions()
Output: 200

Example 6: Creating partitions based on the column expression but with NumPartitions parameter to restrict number of partitions in the target dataframe. In this example, you can observe that the same output for column expression "db_id" mod(remainder) 5 belongs to the same partition. "DB_ID" 12 and 22 belongs to same partiton(remainder is 2), similarly 15 and 50 belongs to same partition(remainder is 0).


from pyspark.sql.functions import col, spark_partition_id
df_update = df.repartition(3, col("db_id")%5)

df_update.select("db_id","db_name",spark_partition_id()).show()
+-----+---------+--------------------+
|db_id|  db_name|SPARK_PARTITION_ID()|
+-----+---------+--------------------+
|   12| Teradata|                   0|
|   12| Teradata|                   0|
|   22|    Mysql|                   0|
|   15|  Vertica|                   1|
|   50|Snowflake|                   1|
|   14|Snowflake|                   2|
|   51|     null|                   2|
+-----+---------+--------------------+


df_update.rdd.getNumPartitions()
Output: 3

This tutorial will explain with examples on how to partition a dataframe randomly or based on specified column(s) of a dataframe.

dbmstutorials.com

PySpark: Dataframe Partitions Part 1