This tutorial will explain how to use various aggregate functions on a dataframe in Pyspark.

PySpark: Dataframe Aggregate Functions



sum function(): sum function can be used to calculate sum of each column passed to this function for each group. This function can be applied to only numeric columns.


count function(): count function can be used to count number of records for each group.

min function(): min function can be used to determine minimum value in each column passed to this function for each group. This function can be applied to only numeric columns but min can be used for non-numerical columns inside 'agg' function.


max function(): max function can be used to determine maximum value in each column passed to this function for each group. This function can be applied to only numeric columns but max can be used for non-numerical columns inside 'agg' function.


avg function(): avg function can be used to calculate average values of each column passed to this function for each group. This function can be applied to only numeric columns.


agg function(): agg function can be used if multiple aggregate functions need to be applied in a single 'Select' statement. Only 1 aggregate function for a single column will return result when using dictionary as parameter in agg function. Columns functions can be used to pass same column for multiple aggregate functions.