PySpark: Dataframe Duplicates

Pradeep

PySpark: Dataframe Duplicates

This tutorial will explain how to find and remove duplicate data /rows from a dataframe with examples using distinct and dropDuplicates functions.

Both distinct and dropDuplicates function's operation will result in shuffle partitions i.e. number of partitions in target dataframe will be different than the original dataframe partitions.
The number of partitions in target dataframe will be equivalent to the value set for "spark.sql.shuffle.partitions" property, default value for this property is 200.
Below listed topics will be explained with examples on this page, click on item in the below list and it will take you to the respective section of the page:
Sample Data: Dataset used in the below examples can be downloaded from here .
```
df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates_v1.csv", header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
|   50|Snowflake|  RDBMS|
+-----+---------+-------+
```

➠ Distinct function: Complete row duplicates can be removed from a dataframe using distinct() function.

Syntax:
```
distinct()
```
- Distinct function does not take any parameter.
- This function will result in shuffle partitions i.e. number of partitions in target dataframe will be different than the original dataframe partitions.
- Number of partitions in the target dataframe will be equal to number set for "spark.sql.shuffle.partitions" property.

Example 1: There was a complete duplicate row for db_id=12 in "df" dataframe, this was removed after applying distinct function.


df_updated = df.distinct()

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
|   14|Snowflake|CloudDB|
|   50|Snowflake|  RDBMS|
+-----+---------+-------+

Example 2: This example will show partition shuffling after distinct function operation.


df_orders=spark.read.table("retail.orders_hive_new").repartition(300)

df_orders.rdd.getNumPartitions()
Output: 300

sqlContext.getConf("spark.sql.shuffle.partitions") # Getting property value
Output: '200'

df_orders.distinct().rdd.getNumPartitions()
Output: 200

Example 3: This example will show another partition shuffling example after distinct function operation.


df_orders=spark.read.table("retail.orders_hive_new").repartition(300)

df_orders.rdd.getNumPartitions()
Output: 300

sqlContext.getConf("spark.sql.shuffle.partitions") # Getting property value
Output: '200'

sqlContext.setConf("spark.sql.shuffle.partitions",305) # Reseting property value

df_orders.distinct().rdd.getNumPartitions()
Output: 305

➠ dropDuplicates function: dropDuplicates() function can be used on a dataframe to either remove complete row duplicates or duplicates based on particular column(s). This function will keep first instance of the record in dataframe and discard other duplicate records. drop_duplicates is an alias for dropDuplicates.

Syntax:
```
dropDuplicates(list of column/columns)
```
- dropDuplicates function can take 1 optional parameter i.e. list of column name(s) to check for duplicates and remove it.
- This function will result in shuffle partitions i.e. number of partitions in target dataframe will be different than the original dataframe partitions.
- Number of partitions in the target dataframe will be equal to number set for "spark.sql.shuffle.partitions" property.

Example 1: dropDuplicates function without any parameter can be used to remove complete row duplicates from a dataframe.


df_updated = df.dropDuplicates()

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   22|    Mysql|  RDBMS|
|   12| Teradata|  RDBMS|
|   15|  Vertica|  RDBMS|
|   14|Snowflake|CloudDB|
|   50|Snowflake|  RDBMS|
+-----+---------+-------+

Example 2: dropDuplicates function with a column name as list, this will keep first instance of the record based on the passed column in a dataframe and discard other duplicate records.


df_updated = df.dropDuplicates(["db_name"])

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   22|    Mysql|  RDBMS|
|   12| Teradata|  RDBMS|
|   15|  Vertica|  RDBMS|
|   14|Snowflake|CloudDB|
+-----+---------+-------+

Example 3: dropDuplicates function with a column names as list, this will keep first instance of the record based on the passed 2 columns in a dataframe and discard other duplicate records.


df_updated = df.dropDuplicates(["db_id", "db_name"])

df_updated.show()
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
|   50|Snowflake|  RDBMS|
|   15|  Vertica|  RDBMS|
|   14|Snowflake|CloudDB|
+-----+---------+-------+

Example 4: This example will show partition shuffling after dropDuplicates function operation.


df_orders=spark.read.table("retail.orders_hive_new").repartition(300)

df_orders.rdd.getNumPartitions()
Output: 300

sqlContext.getConf("spark.sql.shuffle.partitions") # Getting property value
Output: '200'

df_orders.dropDuplicates().rdd.getNumPartitions()
Output: 200

Example 5: This example will show another partition shuffling example after dropDuplicates function operation.


df_orders=spark.read.table("retail.orders_hive_new").repartition(300)

df_orders.rdd.getNumPartitions()
Output: 300

sqlContext.getConf("spark.sql.shuffle.partitions") # Getting property value
Output: '200'

sqlContext.setConf("spark.sql.shuffle.partitions",310) # Reseting property value

df_orders.dropDuplicates().rdd.getNumPartitions()
Output: 310

➠ Remove complete row duplicates using aggregate function: GroupBy can be used along with any aggregate function on all the columns (using df.columns) and then just select the required columns ignoring new aggregate column. Only reason for using aggregate function is that groupBy function must be followed by aggregate function to convert groupBy result-set to dataframe. Aggregate function will also result in shuffle partitions and target dataframe partitions will be equivalent to value of "spark.sql.shuffle.partitions" property.


from pyspark.sql.functions import col
df_updated = df.groupBy(df.columns).count().select(df.columns)

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   50|Snowflake|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df.columns) and then filter can be used to get duplicate records.


from pyspark.sql.functions import col
df_duplicates = df.groupBy(df.columns).count().filter(col("count")>1)

df_duplicates.show()
+-----+--------+-------+-----+
|db_id| db_name|db_type|count|
+-----+--------+-------+-----+
|   12|Teradata|  RDBMS|    2|
+-----+--------+-------+-----+

➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.


from pyspark.sql.functions import col
df_duplicates = df.groupBy("db_name").count().filter(col("count")>1)

df_duplicates.show()
+---------+-----+
|  db_name|count|
+---------+-----+
|Snowflake|    2|
| Teradata|    2|
+---------+-----+

This tutorial will explain how to find and remove duplicate data /rows from a dataframe with examples using distinct and dropDuplicates functions.

dbmstutorials.com

PySpark: Dataframe Duplicates