There are many SET operators (UNION,MINUS & INTERSECT) available in Pyspark and they work in similar fashion as the mathematical SET operations.

PySpark: Dataframe Set Operations

PySpark set operators provide ways to combine similar datasets from two dataframes into a single dataframe. There are many SET operators available in Spark and most of those work in similar way as the mathematical SET operations. These can also be used to compare 2 tables.

Following functions will be covered on this pages, click on item in the below list and it will take you to the respective section of the page:
Sample Data: Dataset used in the below examples can be downloaded from here(dataset 1) and here(dataset 2).
intersect: Intersect function can be used to get common rows from 2 dataframes. This will not return duplicate rows. This is equivalent to INTERSECT in SQL. It does not matter whether any of the dataframe is kept before or after.
intersectAll: IntersectAll function can be used to get common rows from 2 dataframes. This will return duplicate rows only if both dataframes have duplicate rows. This is equivalent to INTERSECT ALL in SQL. It does not matter whether any of the dataframe is kept before or after.
subtract: Subtract function can be used to get rows which are present in 1st dataframe but not in other dataframe. This will not return any duplicate rows. This is equivalent to MINUS in SQL. Output dataframe will be different if dataframe position is changed.
exceptAll: exceptAll function can be used to get rows which are present in 1st dataframe but not in other dataframe. This will return duplicate rows if 1st dataframe has more instance of duplicate rows than 2nd dataframe.
union / unionAll: Union or unionAll function(s) can be used to combine rows from 2 dataframes based on column positions. This will not remove duplicate rows and return duplicate rows in output dataframe. This is equivalent to UNION ALL in SQL. It does not matter whether any of the dataframe is kept before or after.
UnionByName: UnionByName function can be used to combine rows from 2 dataframes based on column names. This will not remove duplicate rows and return duplicate rows in output dataframe.
Union Vs UnionByName: There are 2 main differences between Union and UnionByName.
  1. Union function works based on column position but UnionByName function works based on column name. This can be observed in the below examples, column name was not preserved for the 2nd dataframe in union function (col2 value L is listed in col1 in the output) because it works based on position.
    
    df_3 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")
    df_4 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col3")
    
    -> df_3.union(df_4).show()
    
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   A|   P|   P|
    |   L|   E|   S|
    +----+----+----+
    
    
    
    -> df_3.unionByName(df_4).show()
    
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   A|   P|   P|
    |   E|   L|   S|
    +----+----+----+
    

  2. Mismatched dataframes can be combined using UnionByName (with allowMissingColumns parameter as True) to preserve columns but union function will not preserve column names and combine dataframes purely based on positions.
    
    df_3 =spark.createDataFrame(sc.parallelize([('A', 'P', 'P')])).toDF("col1", "col2", "col3")
    df_4 = spark.createDataFrame(sc.parallelize([('L', 'E', 'S')])).toDF("col2", "col1", "col3")
    
    -> df_3.union(df_4).show()
    
    +----+----+----+
    |col1|col2|col3|
    +----+----+----+
    |   A|   P|   P|
    |   L|   E|   S|
    +----+----+----+
    
    
    
    -> df_5.unionByName(df_6, True).show()
    
    +----+----+----+----+
    |col1|col2|col3|col4|
    +----+----+----+----+
    |   A|   P|   P|null|
    |   E|   L|null|   S|
    +----+----+----+----+