PySpark: Dataframe Sequence Number

There are multiple ways to generate sequence number (incremental number) in Pyspark, this tutorial will explain (with examples) how to generate sequence number using below listed methods.

Row_number analytical function

monotonically_increasing_id column function

Sample Data: Dataset used in the below examples can be downloaded from here.


df = spark.read.csv("file:///path_to_files/csv_with_duplicates_and_nulls.csv",header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|   null|
|   50|Snowflake|  RDBMS|
|   51|     null|CloudDB|
+-----+---------+-------+

➠ row_number(): By using row_number analytical function

Moving all data to a single node to generate unique number using row number, can cause serious performance degradation.


from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

df_update = df.withColumn("seq_num", row_number().over(Window.orderBy("db_id")))

df_update.show()
+-----+---------+-------+-------+
|db_id|  db_name|db_type|seq_num|
+-----+---------+-------+-------+
|   12| Teradata|  RDBMS|      1|
|   12| Teradata|  RDBMS|      2|
|   14|Snowflake|CloudDB|      3|
|   15|  Vertica|  RDBMS|      4|
|   22|    Mysql|   null|      5|
|   50|Snowflake|  RDBMS|      6|
|   51|     null|CloudDB|      7|
+-----+---------+-------+-------+

➠ monotonically_increasing_id(): By using monotonically_increasing_id column function

Spark guarantee that generated number will be increasing and unique but it may not be a consecutive number.
8,589,934592 this many numbers are reserved for each partition, this is based on Spark's assumption that dataframe has less than 1 billion partitions and each partition has less than 8 billion rows.
Generally if there is a single partition in a dataframe then it will generate number in a consecutive order else not.

Example 1: When there is a single partition then it has generated consecutive numbers.


from pyspark.sql.functions import monotonically_increasing_id

df.rdd.getNumPartitions()  # 1
df_update = df.withColumn("seq_num", monotonically_increasing_id())

df_update.show()
+-----+---------+-------+-------+
|db_id|  db_name|db_type|seq_num|
+-----+---------+-------+-------+
|   12| Teradata|  RDBMS|      0|
|   14|Snowflake|CloudDB|      1|
|   15|  Vertica|  RDBMS|      2|
|   12| Teradata|  RDBMS|      3|
|   22|    Mysql|   null|      4|
|   50|Snowflake|  RDBMS|      5|
|   51|     null|CloudDB|      6|
+-----+---------+-------+-------+

Example 2: When there are multiple partition then it will not generate consecutive numbers.


from pyspark.sql.functions import monotonically_increasing_id

df.rdd.getNumPartitions()  # 1
df = df.repartition(4)

df.rdd.getNumPartitions()  #4

df_update = df.withColumn("seq_num", monotonically_increasing_id())

df_update.show()
+-----+---------+-------+-----------+
|db_id|  db_name|db_type|    seq_num|
+-----+---------+-------+-----------+
|   22|    Mysql|   null|          0|
|   12| Teradata|  RDBMS|          1|
|   51|     null|CloudDB| 8589934592|
|   15|  Vertica|  RDBMS| 8589934593|
|   14|Snowflake|CloudDB|17179869184|
|   50|Snowflake|  RDBMS|25769803776|
|   12| Teradata|  RDBMS|25769803777|
+-----+---------+-------+-----------+

Example 3: When there are multiple partition then it will generate consecutive numbers only within same partition.


from pyspark.sql.functions import monotonically_increasing_id, spark_partition_id

df.rdd.getNumPartitions()  # 1
df = df.repartition(4)

df.rdd.getNumPartitions()  #4

df_update = df.withColumn("seq_num", monotonically_increasing_id()).withColumn("partition#", spark_partition_id())

df_update.show()
+-----+---------+-------+-----------+----------+
|db_id|  db_name|db_type|    seq_num|partition#|
+-----+---------+-------+-----------+----------+
|   22|    Mysql|   null|          0|         0|
|   12| Teradata|  RDBMS|          1|         0|
|   51|     null|CloudDB| 8589934592|         1|
|   15|  Vertica|  RDBMS| 8589934593|         1|
|   14|Snowflake|CloudDB|17179869184|         2|
|   50|Snowflake|  RDBMS|25769803776|         3|
|   12| Teradata|  RDBMS|25769803777|         3|
+-----+---------+-------+-----------+----------+

This tutorial will explain (with examples) how to generate sequence number using row_number and monotonically_increasing_id functions

dbmstutorials.com

PySpark: Dataframe Sequence Number