This tutorial will explain how to find and remove duplicate data /rows from a dataframe with examples using distinct and dropDuplicates functions.

PySpark: Dataframe Duplicates

This tutorial will explain how to find and remove duplicate data /rows from a dataframe with examples using distinct and dropDuplicates functions.

Distinct function: Complete row duplicates can be removed from a dataframe using distinct() function.

dropDuplicates function: dropDuplicates() function can be used on a dataframe to either remove complete row duplicates or duplicates based on particular column(s). This function will keep first instance of the record in dataframe and discard other duplicate records. drop_duplicates is an alias for dropDuplicates.

Remove complete row duplicates using aggregate function: GroupBy can be used along with any aggregate function on all the columns (using df.columns) and then just select the required columns ignoring new aggregate column. Only reason for using aggregate function is that groupBy function must be followed by aggregate function to convert groupBy result-set to dataframe. Aggregate function will also result in shuffle partitions and target dataframe partitions will be equivalent to value of "spark.sql.shuffle.partitions" property.
Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df.columns) and then filter can be used to get duplicate records.
Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.