This tutorial will explain how filters can be used on dataframes in Pyspark. where() function is an alias for filter() function.

PySpark: Dataframe Filters

This tutorial will explain how filters can be used on dataframes in Pyspark. where() function is an alias for filter() function. Following topics will be covered on this page:


Filter Syntax: Filter function takes only 1 parameter. One or multiple conditions can be used to filter data, each condition will evaluate to either True or False. where() function is an alias for filter() function.
Basic Filters:
Filter using IN clause: isin attribute of col function can be used to pass comma separated values or a list as parameter to filter data.
Filter using not IN clause: Tilde sign(~) can be used along with isin attribute of col function to negate passed comma separated values or a list as parameter to filter data.
Filter using List: List can be passed as parameter to isin attribute of col function.
Filter Null Values: Null values can only be queried using isNull attribute of col function. Rows were fetched where manager_id was null in the below example.
Filter not Null Values: isNotNull attribute of col function can be used to filter out null values.
Filter using LIKE operator: LIKE attribute of col function can be used to filter data based on partial string / pattern in the column. This can also be used to check if column value start with particular string/character or end with particular string/character.

Filter using not LIKE operator: Tilde sign(~) can be used along with LIKE attribute of col function to negate string passed in LIKE clause. This can also be used to check if column value does not start with particular string/character or does not end with particular string/character.

Filter using Contains: Contain attribute of col function looks for a string or a character anywhere in the column and return matched data. Contain can perform subset of operations that LIKE operator can perform.
Filter using Between: Between attribute of col function can be used to filter data from a column based on lower and upper range. In the below example, all rows will be returned where salary is between 2500 and 4500.
Multicolumn filters: Multiple columns can be used to filter data in dataframe. Pipe(|) can be used between conditions to perform OR operation as in SQL joins and ampersand(&) can be used between conditions to perform AND operation as in SQL joins.