PySpark: Dataframe Array Functions Part 5

This tutorial will explain with examples how to use arrays_overlap and arrays_zip array functions in Pyspark. Other array functions can be viewed by clicking functions in the below list.

Sample Data:

here

from pyspark.sql import Row

df = spark.createDataFrame([Row(col1=["b", "a", "c",None], col2=["c", "d", "a", "f"])])

df.show()
+---------+------------+
|     col1|        col2|
+---------+------------+
|[b, a, c]|[c, d, a, f]|
+---------+------------+

➠ arrays_overlap: This function can be used to check whether there is any overlap of elements in 2 arrays. It is available to import from Pyspark Sql function library.

Syntax: If both arrays are non-empty with any of them having null element then it will return null instead of false.
```
arrays_overlap(array_column1, array_colum2)
```
- → 1st parameter(array_column1) takes a column name containing array.
- → 2nd parameter(array_column2) takes a column name containing array.

Example 1: True was returned as output because there was a common element "a".

import pyspark.sql.functions as f

df_updated = df.select(df.col1, df.col2, f.arrays_overlap(df.col1, df.col2).alias("is_overlap"))

df_updated.show()
+---------+------------+----------+
|     col1|        col2|is_overlap|
+---------+------------+----------+
|[b, g, y]|[c, d, a, f]|     false|
+---------+------------+----------+

Example 2: Null was returned as output since there was a null element in one of the array.

import pyspark.sql.functions as f
from pyspark.sql import Row

df = spark.createDataFrame([Row(col1=["b", "g", "y",None], col2=["c", "d", "a", "f"])])

df.show()
+----------+------------+
|      col1|        col2|
+----------+------------+
|[b, g, y,]|[c, d, a, f]|
+----------+------------+

df_updated = df.select(df.col1, df.col2, f.arrays_overlap(df.col1, df.col2).alias("is_overlap"))

df_updated.show()
+----------+------------+----------+
|      col1|        col2|is_overlap|
+----------+------------+----------+
|[b, g, y,]|[c, d, a, f]|      null|
+----------+------------+----------+

Example 3: False was returned as output since there is no null element in any of the array.

import pyspark.sql.functions as f
from pyspark.sql import Row

df = spark.createDataFrame([Row(col1=["b", "g", "y"], col2=["c", "d", "a", "f"])])

df.show()
+---------+------------+
|     col1|        col2|
+---------+------------+
|[b, g, y]|[c, d, a, f]|
+---------+------------+

df_updated = df.select(df.col1, df.col2, f.arrays_overlap(df.col1, df.col2).alias("is_overlap"))

df_updated.show()
+---------+------------+----------+
|     col1|        col2|is_overlap|
+---------+------------+----------+
|[b, g, y]|[c, d, a, f]|     false|
+---------+------------+----------+

➠ arrays_zip: This function can be used to merge 2 or more arrays together, same index elements will be zipped together. It is available to import from Pyspark Sql function library.

Syntax: Same index elements will be zipped together.
```
arrays_zip(*column)
```
- → This parameter can take multiple array type columns.

Example 1:

import pyspark.sql.functions as f

df_updated = df.select(df.col1, df.col2, f.arrays_zip(df.col1, df.col2).alias("zipped_array"))

df_updated.show(truncate=False)
+----------+------------+-------------------------------+
|col1      |col2        |zipped_array                   |
+----------+------------+-------------------------------+
|[b, g, y,]|[c, d, a, f]|[[b, c], [g, d], [y, a], [, f]]|
+----------+------------+-------------------------------+

Example 2:

import pyspark.sql.functions as f
from pyspark.sql import Row

df = spark.createDataFrame([Row(col1=["b", "g", "y"], col2=["c", "d", "a", "f"])])

df.show()
+---------+------------+
|     col1|        col2|
+---------+------------+
|[b, g, y]|[c, d, a, f]|
+---------+------------+

df_updated = df.select(df.col1, df.col2, f.arrays_zip(df.col1, df.col2).alias("zipped_array"))

df_updated.show(truncate=False)
+---------+------------+-------------------------------+
|col1     |col2        |zipped_array                   |
+---------+------------+-------------------------------+
|[b, g, y]|[c, d, a, f]|[[b, c], [g, d], [y, a], [, f]]|
+---------+------------+-------------------------------+

Example 3: 3 arrays were zipped together.

import pyspark.sql.functions as f
from pyspark.sql import Row

df = spark.createDataFrame([Row(col1=["b", "g", "y"], col2=["c", "d", "a", "f"], col3=["p", "e", "k", "t"])])

df.show()
+---------+------------+------------+
|     col1|        col2|        col3|
+---------+------------+------------+
|[b, g, y]|[c, d, a, f]|[p, e, k, t]|
+---------+------------+------------+

df_updated = df.select(df.col1, df.col2, df.col3, f.arrays_zip(df.col1, df.col2, df.col3).alias("zipped_array"))

df_updated.show(truncate=False)
+---------+------------+------------+-------------------------------------------+
|col1     |col2        |col3        |zipped_array                               |
+---------+------------+------------+-------------------------------------------+
|[b, g, y]|[c, d, a, f]|[p, e, k, t]|[[b, c, p], [g, d, e], [y, a, k], [, f, t]]|
+---------+------------+------------+-------------------------------------------+

This tutorial will explain with examples how to use arrays_overlap and arrays_zip array functions in Pyspark.

dbmstutorials.com

PySpark: Dataframe Array Functions Part 5