PySpark: Dataframe Array Functions Part 3

This tutorial will explain with examples how to use array_position, array_contains and array_remove array functions in Pyspark. Other array functions can be viewed by clicking functions in the below list.

Sample Data:

here


df = spark.createDataFrame([(["d","a", "b","a", "c"],), (["f","d","a", None],)], ['data'])

df.show()
+---------------+
|           data|
+---------------+
|[d, a, b, a, c]|
|     [f, d, a,]|
+---------------+

➠ array_position: This function can be used to find index of a particular element in the array column. Also it will only return position of first occurence. Position function assume first index as 1 and it will return 0 (zero) if value is not present in the array. It is available to import from Pyspark Sql function library.

Syntax: It will return null if any of the parameter value is null.
```
array_position(column, value)
```
- → 1st parameter(column) takes a column name containing array.
- → 2nd parameter(value) takes a value to find index of that element.

Example 1: First position of element "a" was returned.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_position(df.data, "a").alias("position_check"))

df_updated.show()
+---------------+--------------+
|           data|position_check|
+---------------+--------------+
|[d, a, b, a, c]|             2|
|     [f, d, a,]|             3|
+---------------+--------------+

Example 2: Position of element "c" was returned for 1st row and 0 for the 2nd row.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_position(df.data, "c").alias("position_check"))

df_updated.show()
+---------------+--------------+
|           data|position_check|
+---------------+--------------+
|[d, a, b, a, c]|             5|
|     [f, d, a,]|             0|
+---------------+--------------+

Example 3: Zero(0) was returned since "Z" is not present in any row

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_position(df.data, "Z").alias("position_check"))

df_updated.show()
+---------------+--------------+
|           data|position_check|
+---------------+--------------+
|[d, a, b, a, c]|             0|
|     [f, d, a,]|             0|
+---------------+--------------+

➠ array_contains: This function can be used to check if the particular value is present in the array or not. It is available to import from Pyspark Sql function library.

Syntax: It will return null if array column is null. If the array is non-empty with a null element that does not contain matching value then it will return null instead of false.
```
array_contains(column, value)
```
- → 1st parameter(column) takes a column name containing array.
- → 2nd parameter(value) takes a value to check that in the array.
Note:

Example 1: True was returned as element "a" was present.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_contains(df.data, "a").alias("contains_check"))

df_updated.show()
+---------------+--------------+
|           data|contains_check|
+---------------+--------------+
|[d, a, b, a, c]|          true|
|     [f, d, a,]|          true|
+---------------+--------------+

Example 2: True was returned as element "c" is present for 1st row and null for the 2nd row as element "c" is not present & there is null value in that array value.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_contains(df.data, "c").alias("contains_check"))

df_updated.show()
+---------------+--------------+
|           data|contains_check|
+---------------+--------------+
|[d, a, b, a, c]|          true|
|     [f, d, a,]|          null|
+---------------+--------------+

Example 3: False was returned as element "Z" is not present for 1st row and null for the 2nd row as element "Z" is not present & there is null value in that array value.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_contains(df.data, "Z").alias("contains_check"))

df_updated.show()
+---------------+--------------+
|           data|contains_check|
+---------------+--------------+
|[d, a, b, a, c]|         false|
|     [f, d, a,]|          null|
+---------------+--------------+

➠ array_remove: This function can be used to remove particular element from the array column. It will remove all the occurrence of that element. It is available to import from Pyspark Sql function library.

Syntax:
```
array_remove(column, value)
```
- → 1st parameter(column) takes a column name containing array.
- → 2nd parameter(value) takes a value to find index of that element.

Example 1: All occurrences of element "a" were removed.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_remove(df.data, "a").alias("removed_data"))

df_updated.show()
+---------------+------------+
|           data|removed_data|
+---------------+------------+
|[d, a, b, a, c]|   [d, b, c]|
|     [f, d, a,]|     [f, d,]|
+---------------+------------+

Example 2: Element "c" was removed for 1st row.

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_remove(df.data, "c").alias("removed_data"))

df_updated.show()
+---------------+------------+
|           data|removed_data|
+---------------+------------+
|[d, a, b, a, c]|[d, a, b, a]|
|     [f, d, a,]|  [f, d, a,]|
+---------------+------------+

Example 3: Nothing was removed since "Z" is not present in any row

import pyspark.sql.functions as f

df_updated = df.select(df.data, f.array_remove(df.data, "Z").alias("removed_data"))

df_updated.show()
+---------------+---------------+
|           data|   removed_data|
+---------------+---------------+
|[d, a, b, a, c]|[d, a, b, a, c]|
|     [f, d, a,]|     [f, d, a,]|
+---------------+---------------+

This tutorial will explain with examples how to use array_position, array_contains and array_remove array functions in Pyspark.

dbmstutorials.com

PySpark: Dataframe Array Functions Part 3