PySpark: Dataframe Set Operations

Pradeep

PySpark: Dataframe Set Operations

PySpark set operators provide ways to combine similar datasets from two dataframes into a single dataframe. There are many SET operators available in Spark and most of those work in similar way as the mathematical SET operations. These can also be used to compare 2 tables.

Following functions will be covered on this pages, click on item in the below list and it will take you to the respective section of the page:

intersect
intersectAll
subtract
exceptAll
union / unionAll
UnionByName
Union Vs UnionByName

Sample Data: Dataset used in the below examples can be downloaded from here(dataset 1) and here(dataset 2).


df_1 = spark.read.option("header",True).csv("file:///path_to_file/set_example_file_1.csv")
df_2 = spark.read.option("header",True).csv("file:///path_to_file/set_example_file_2.csv")

df_1.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

df_2.show()
+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   21|SingleStore|  RDBMS|
|   22|      Mysql|  RDBMS|
|   14|  Snowflake|  RDBMS|
|   17|     Oracle|  RDBMS|
+-----+-----------+-------+

➠ intersect: Intersect function can be used to get common rows from 2 dataframes. This will not return duplicate rows. This is equivalent to INTERSECT in SQL. It does not matter whether any of the dataframe is kept before or after.

Example 1:


df_1.intersect(df_2).show()

+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|   17| Oracle|  RDBMS|
|   19|MongoDB|  NOSQL|
+-----+-------+-------+

Example 2: Changing the position of dataframes will not change the output dataframe.


df_2.intersect(df_1).show()

+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|   17| Oracle|  RDBMS|
|   19|MongoDB|  NOSQL|
+-----+-------+-------+

➠ intersectAll: IntersectAll function can be used to get common rows from 2 dataframes. This will return duplicate rows only if both dataframes have duplicate rows. This is equivalent to INTERSECT ALL in SQL. It does not matter whether any of the dataframe is kept before or after.


df_1.intersectAll(df_2).show()

+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|   17| Oracle|  RDBMS|
|   17| Oracle|  RDBMS|
|   19|MongoDB|  NOSQL|
+-----+-------+-------+

➠ subtract: Subtract function can be used to get rows which are present in 1st dataframe but not in other dataframe. This will not return any duplicate rows. This is equivalent to MINUS in SQL. Output dataframe will be different if dataframe position is changed.

Example 1:


df_1.subtract(df_2).show()

+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
+-----+---------+-------+

Example 2: Changing the position of dataframes will change the output dataframe.


df_2.subtract(df_1).show()

+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   14|  Snowflake|  RDBMS|
|   22|      Mysql|  RDBMS|
|   21|SingleStore|  RDBMS|
+-----+-----------+-------+

➠ exceptAll: exceptAll function can be used to get rows which are present in 1st dataframe but not in other dataframe. This will return duplicate rows if 1st dataframe has more instance of duplicate rows than 2nd dataframe.

Example 1: In the below exceptAll example, row with db_name "MongoDB" was returned because 1st dataframe has 2 instance of "MongoDB" row and 2nd dataframe has only 1 "MongoDB" row.


df_1.exceptAll(df_2).show()

+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   19|  MongoDB|  NOSQL|
|   14|Snowflake|CloudDB|
+-----+---------+-------+

Example 2: Changing the position of dataframes will change the output dataframe. As mentioned earlier, this function will return duplicate rows if 1st dataframe has more instance of duplicate rows than 2nd dataframe.


df_2.exceptAll(df_1).show()

+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   14|  Snowflake|  RDBMS|
|   22|      Mysql|  RDBMS|
|   21|SingleStore|  RDBMS|
+-----+-----------+-------+

➠ union / unionAll: Union or unionAll function(s) can be used to combine rows from 2 dataframes based on column positions. This will not remove duplicate rows and return duplicate rows in output dataframe. This is equivalent to UNION ALL in SQL. It does not matter whether any of the dataframe is kept before or after.

Example 1:


df_1.union(df_2).show()

+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   12|   Teradata|  RDBMS|
|   14|  Snowflake|CloudDB|
|   15|    Vertica|  RDBMS|
|   17|     Oracle|  RDBMS|
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   19|    MongoDB|  NOSQL|
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   21|SingleStore|  RDBMS|
|   22|      Mysql|  RDBMS|
|   14|  Snowflake|  RDBMS|
|   17|     Oracle|  RDBMS|
+-----+-----------+-------+

Example 2: UnionAll is the alias for union function.


df_1.unionAll(df_2).show()

+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   12|   Teradata|  RDBMS|
|   14|  Snowflake|CloudDB|
|   15|    Vertica|  RDBMS|
|   17|     Oracle|  RDBMS|
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   19|    MongoDB|  NOSQL|
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   21|SingleStore|  RDBMS|
|   22|      Mysql|  RDBMS|
|   14|  Snowflake|  RDBMS|
|   17|     Oracle|  RDBMS|
+-----+-----------+-------+

➠ UnionByName: UnionByName function can be used to combine rows from 2 dataframes based on column names. This will not remove duplicate rows and return duplicate rows in output dataframe.

Syntax: Optional parameter "allowMissingColumns" was added in Spark 3.1.0. When this parameter is passed as True then columns in 2 dataframes can differ and values of columns which are missing in dataframe, will be filled with null.
```
unionByName(2nd dataframe, allowMissingColumns=False)
```

Example 1: This example will show that union of dataframes worked fine(using column names) even though the position of columns is different in 2 dataframes.


df_3 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")

df_4 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col3")

df_3.unionByName(df_4).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   P|   P|
|   E|   L|   S|
+----+----+----+

Example 2: (Applicable for Spark v3.1 and above) When allowMissingColumns parameter is not passed as True and if there are column mismatches then it will fail with "Cannot resolve column name" error.


df_5 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")

df_6 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col4")

df_5.unionByName(df_6).show()

pyspark.sql.utils.AnalysisException: Cannot resolve column name "col3" among (col2, col1, col4);

Example 3: In the below example, col3 is missing in dataframe df_5 and col4 is missing in dataframe df_6 but UnionByName worked fine because allowMissingColumns parameter value was passed as True.


df_5 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")

df_6 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col4")

df_5.unionByName(df_6, True).show()

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   P|   P|null|
|   E|   L|null|   S|
+----+----+----+----+

Example 4: When column positions and names are same then it will behave same as union function.


df_1.unionByName(df_2).show()

+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   12|   Teradata|  RDBMS|
|   14|  Snowflake|CloudDB|
|   15|    Vertica|  RDBMS|
|   17|     Oracle|  RDBMS|
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   19|    MongoDB|  NOSQL|
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   21|SingleStore|  RDBMS|
|   22|      Mysql|  RDBMS|
|   14|  Snowflake|  RDBMS|
|   17|     Oracle|  RDBMS|
+-----+-----------+-------+

➠ Union Vs UnionByName: There are 2 main differences between Union and UnionByName.

Union function works based on column position but UnionByName function works based on column name. This can be observed in the below examples, column name was not preserved for the 2nd dataframe in union function (col2 value L is listed in col1 in the output) because it works based on position.


df_3 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")
df_4 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col3")

-> df_3.union(df_4).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   P|   P|
|   L|   E|   S|
+----+----+----+



-> df_3.unionByName(df_4).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   P|   P|
|   E|   L|   S|
+----+----+----+

Mismatched dataframes can be combined using UnionByName (with allowMissingColumns parameter as True) to preserve columns but union function will not preserve column names and combine dataframes purely based on positions.


df_3 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")
df_4 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col3")

-> df_3.union(df_4).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   P|   P|
|   L|   E|   S|
+----+----+----+



-> df_5.unionByName(df_6, True).show()

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   P|   P|null|
|   E|   L|null|   S|
+----+----+----+----+

There are many SET operators (UNION,MINUS & INTERSECT) available in Pyspark and they work in similar fashion as the mathematical SET operations.

dbmstutorials.com

PySpark: Dataframe Set Operations