This tutorial will explain various types of joins that are supported in Pyspark and some challenges in joining 2 tables having same column names.

PySpark: Dataframe Joins

This tutorial will explain various types of joins that are supported in Pyspark. It will also cover some challenges in joining 2 tables having same column names. Following topics will be covered on this page:


Join Syntax: Join function can take up to 3 parameters, 1st parameter is mandatory and other 2 are optional.
Inner Join: Inner join can be used to return matched records based on common column(s) from the both dataframes. Inner join is the default join if 3rd parameter is not passed in the join function. Below image shows pictorial representation of inner join, only gray colored portion of data will be return out of 2 dataframes i.e. records which are common to both dataframes.


Left / leftouter / left_outer Join: Left Outer join is used to return matched records from the right dataframe and matched/unmatched records from the left dataframe. Left, leftouter and left_outer Join are alias of each other. Below image shows pictorial representation of left outer join, only gray colored portion of data will be return out of 2 dataframes i.e. all data from the left dataframe and only matched data from the right dataframe.


Right / rightouter / right_outer Join: Right Outer join is used to return matched records from the left dataframe and matched/unmatched records from the right dataframe. Right, rightouter and right_outer are alias of each other. Below image shows pictorial representation of right outer join, only gray colored portion of data will be return out of 2 dataframes i.e. all data from the right dataframe and only matched data from the left dataframe.


Outer / full / fullouter / full_outer Join: Outer or Full_outer join is used to return all of the records that have values in either the left or right dataframe. Outer, full, fullouter and full_outer are alias of each other. Below image shows pictorial representation of full outer join, all gray colored portion of data will be return out of 2 dataframes i.e. all data from the right dataframe and the left dataframe.


Cross Join: This join will returns the cartesian product of rows from the dataframes in the join. Cross join will happen when joining condition or column are not specified.


Semi / leftsemi / left_semi Join: This join is similar to inner join but it will not return data / fields from right dataframe. leftsemi, left_semi and semi are alias of each other.


Anti / leftanti / leftanti Join: This join is a kind of joined minus and it will also not return data from right dataframe. It will return rows where join column value is not present in right dataframe. Anti, leftanti and leftanti are alias of each other. Below image shows pictorial representation of Anti join in spark, only gray colored portion of data will be return after anti join of 2 dataframes i.e. records from the left dataframe which are not common to the right dataframe.


Special Conditions and Challenges in Joins