This tutorial is continuation of the part 1 and part 2 which explains how to partition a dataframe randomly or based on specified column(s) of a dataframe and some of the partition related operations.
df = spark.read.csv("file:///path_to_files/csv_with_duplicates_and_nulls.csv",header=True)
df.show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 14|Snowflake|CloudDB|
| 15| Vertica| RDBMS|
| 12| Teradata| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
df_update = df.repartition(4)
sqlContext.getConf("spark.sql.shuffle.partitions")
Output: 200
spark.sql("set spark.sql.shuffle.partitions=100")
sqlContext.getConf("spark.sql.shuffle.partitions")
Output: 100
from pyspark.sql.functions import spark_partition_id
df_update.groupBy(spark_partition_id()).count().show()
+--------------------+-----+
|SPARK_PARTITION_ID()|count|
+--------------------+-----+
| 1| 2|
| 3| 2|
| 2| 1|
| 0| 2|
+--------------------+-----+
df_update.rdd.getNumPartitions()
Output: 4