PySpark: Dataframe Explode
Sample Data: Following 2 dataset will be used in the below examples.
Dataset 1:
data_list = [(1, [5,6,7]), (2, [6,7,8])]
df= spark.createDataFrame( sc.parallelize(data_list) )
df.show()
+---+---------+
| _1| _2|
+---+---------+
| 1|[5, 6, 7]|
| 2|[6, 7, 8]|
+---+---------+
Dataset 2:
data_list_2=[(1, [2,3,4]), (2,[3,4,5]), (3, [])]
df_2= spark.createDataFrame(sc.parallelize(data_list_2))
df_2.show()
+---+---------+
| _1| _2|
+---+---------+
| 1|[2, 3, 4]|
| 2|[3, 4, 5]|
| 3| []|
+---+---------+
➠
explode function(): explode function can be used to flatten array column values as rows.
- Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col".
from pyspark.sql.functions import explode
explode(array_column)
- Example: explode function will take array column as input and return column named "col" if not aliased with required column name for flattened column.
from pyspark.sql.functions import explode
df_update = df.select("_1", explode("_2"))
df_update.show()
+---+---+
| _1|col|
+---+---+
| 1| 5|
| 1| 6|
| 1| 7|
| 2| 6|
| 2| 7|
| 2| 8|
+---+---+
➠
posexplode function(): posexplode function works similar to explode function to flatten array column values as rows but it will also return position of the array value as additional column.
- Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col" and position of each value in the array.
from pyspark.sql.functions import posexplode
posexplode(array_column)
- Example: posexplode function will take array column as input and return 2 columns named "pos" of value in array and named "col" which will hold actual value.
from pyspark.sql.functions import posexplode
df_update = df.select("_1", posexplode("_2"))
df_update.show()
+---+---+---+
| _1|pos|col|
+---+---+---+
| 1| 0| 5|
| 1| 1| 6|
| 1| 2| 7|
| 2| 0| 6|
| 2| 1| 7|
| 2| 2| 8|
+---+---+---+
➠
explode_outer function(): explode_outer function will work exactly like explode function, only difference will be that explode function will not return records if array is empty but explode_outer function will return such records as well.
- Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col".
from pyspark.sql.functions import explode_outer
explode_outer(array_column)
- Example: explode_outer function will take array column as input and return column named "col" if not aliased with required column name for flattened column.
from pyspark.sql.functions import explode_outer
df_update = df_2.select("_1", explode_outer("_2"))
df_update.show()
+---+----+
| _1| col|
+---+----+
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 3|
| 2| 4|
| 2| 5|
| 3|null| # Null was returned
+---+----+
➠
posexplode_outer function(): posexplode_outer function will work exactly like posexplode function, only difference will be that posexplode function will not return records if array is empty but posexplode_outer function will return such records as well.
- Syntax: It can take 1 array column as parameter and returns flattened values into rows with a column named "col" and position of each value in the array.
from pyspark.sql.functions import posexplode_outer
posexplode_outer(array_column)
- Example: posexplode_outer function will take array column as input and return 2 columns named "pos" of value in array and named "col" which will hold actual value.
from pyspark.sql.functions import posexplode_outer
df_update = df_2.select("_1", posexplode_outer("_2"))
df_update.show()
+---+----+----+
| _1| pos| col|
+---+----+----+
| 1| 0| 2|
| 1| 1| 3|
| 1| 2| 4|
| 2| 0| 3|
| 2| 1| 4|
| 2| 2| 5|
| 3|null|null| # Null was returned
+---+----+----+