PySpark: Dataframe Handing Nulls
- This tutorial will explain how to use various functions available in DataFrameNaFunctions class to handle null or missing values, click on item in the below list and it will take you to the respective section of the page(s):
Sample Data: Dataset used in the below examples can be downloaded from here.
df = spark.read.parquet("file:///path_to_files/null_handling_sample_data.parquet")
df.show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| null| Vertica| RDBMS|
| null| null| null|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
➠
drop: This function inside 'na' class function can be used to remove rows with null values.
'na.drop' and 'dropna' functions are aliases of each other.
- Syntax: It can take 3 optional parameters and returns a new processed dataframe.
na.drop( how='any', thresh=None, subset=None)
dropna( how='any', thresh=None, subset=None)
- 1st parameter is 'how' which can take either of 2 string values('all','any').
The default is 'any' to remove any row where any value is null.
'all' can be used to remove rows if all of its values are null
- 2nd parameter is 'threshold' which takes int value.
It can be used to specify how many non nulls values must be present per row and this will override 'how' parameter.
- 3rd parameter is 'subset' which can be used to pass list of column names where function will check nulls and not in other columns.
- Example 1: The default is 'any' to remove any row where any value is null.
df.dropna().show() # same as df.na.drop().show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 50|Snowflake| RDBMS|
+-----+---------+-------+
- Example 2: When only 'how' parameter is used with 'all' value, then it will only remove rows if all the values of columns are null.
df.na.drop('all').show() #same as df.dropna('all').show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| null| Vertica| null|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
- Example 3: subset parameter is used to check nulls only in 1 column (db_id) and it will remove row if "db_id" column value will be null.
df.na.drop('any', subset=["db_id"]).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
- Example 4: subset parameter is used to check nulls only in 2 columns and it will remove row if either of the 2 or both columns will be null.
df.na.drop('any', subset=["db_id", "db_type"]).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 12| null| RDBMS|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
- Example 5: subset parameter is used to check nulls only in 2 columns and it will remove row only if both columns will be null.
df.na.drop('all', subset=["db_id", "db_type"]).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
- Example 6: 2nd parameter (threshold) is passed as 2, therefore at-least 2 values should be not null in a row.
df.dropna('all', 2).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
- Example 7: 2nd parameter (threshold) is passed as 1, therefore at-least 1 values should be not null in a row.
df.dropna('all', 1).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| null| Vertica| null|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
➠
fill: This function inside 'na' class or fillna dataframe function can be used to replace null values in dataframe rows.
'na.fill' and 'fillna' functions are aliases of each other.
- Syntax: It can take 2 parameters and returns a new processed dataframe.
na.fill(value, subset=None)
fillna(value, subset=None)
- 1st parameter is 'value' which can take number, string, boolean or dictionary value.
- If number is passed as parameter then it will fill null values with that number for all numerical columns.
- If string is passed as parameter then it will fill null values with that string for all string columns.
- If boolean is passed as parameter then it will fill null values with that boolean value for all boolean columns.
- Dictionary object (with Key as column name and value to be replaced) can be passed as parameter to fill columns with customized values. And if dictionary object is passed then subset parameter will be ignored.
- 2nd parameter is subset which can be used to pass list of column names where function will check nulls and not in other columns.
- Example 1: Number 0 was passed as parameter and it will update all the number columns having nulls with this value.
As only 1 integer column("db_id") was there, it was updated to 0.
df.fillna(0).show() # same as df.na.fill(0).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 0| Vertica| null|
| 0| null| null|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
- Example 2: String "UnAssigned" was passed as parameter and it will update all the string columns having nulls with this value.
Nulls in all string columns("db_name" and "db_type" in this case) were updated to UnAssigned.
df.fillna("UnAssigned").show() # same as df.na.fill("UnAssigned").show()
+-----+----------+----------+
|db_id| db_name| db_type|
+-----+----------+----------+
| 12| Teradata| RDBMS|
| null| Vertica|UnAssigned|
| null|UnAssigned|UnAssigned|
| 12|UnAssigned| RDBMS|
| 22| Mysql|UnAssigned|
| 50| Snowflake| RDBMS|
| 51|UnAssigned| CloudDB|
+-----+----------+----------+
- Example 3: String "UnAssigned" was passed as 1st parameter and list of columns ("db_name") was passed as 2nd parameter.
It will update only string columns(since string was passed) passed in the list(only "db_name" in this case).
df.fillna("UnAssigned" , ["db_name"]).show()
+-----+----------+-------+
|db_id| db_name|db_type|
+-----+----------+-------+
| 12| Teradata| RDBMS|
| null| Vertica| null|
| null|UnAssigned| null|
| 12|UnAssigned| RDBMS|
| 22| Mysql| null|
| 50| Snowflake| RDBMS|
| 51|UnAssigned|CloudDB|
+-----+----------+-------+
- Example 4: Dictionary object can be passed to fill columns having nulls with customized values for each column.
df.fillna({"db_id" : 40, "db_type" : "UnAssigned"}).show()
+-----+---------+----------+
|db_id| db_name| db_type|
+-----+---------+----------+
| 12| Teradata| RDBMS|
| 40| Vertica|UnAssigned|
| 40| null|UnAssigned|
| 12| null| RDBMS|
| 22| Mysql|UnAssigned|
| 50|Snowflake| RDBMS|
| 51| null| CloudDB|
+-----+---------+----------+
➠
Filter Null Values: Null values can only be queried using isNull attribute of col function.
Rows were fetched where manager_id was null in the below example.
from pyspark.sql.functions import col
df.filter(col("db_type").isNull()).show()
+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
| null|Vertica| null|
| null| null| null|
| 22| Mysql| null|
+-----+-------+-------+
➠
Filter not Null Values: isNotNull attribute of col function can be used to filter out null values.
from pyspark.sql.functions import col
df.filter(col("db_id").isNotNull()).show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 12| null| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+