PySpark: Dataframe Aggregate Functions
- This tutorial will explain how to use various aggregate functions on a dataframe in Pyspark.
- groupBy dataframe function can be used to aggregate values at particular grain (columns).
- groupby is an alias for groupBy dataframe function.
- Some of the dataframe operations including aggregation (groupBy) will result in shuffle partition. Users can visit this page to understand more about shuffle partition.
- Click on item in the below list and it will take you to the respective section of the page:
Sample Data: Dataset used in the below examples can be downloaded from here.
empdf = spark.read.parquet("file:///path_to_files/employee_with_comm.parquet")
empdf.show()
+-------+--------+-------+----------+-------+-----+
| emp_no|emp_name| salary|manager_id|dept_no| comm|
+-------+--------+-------+----------+-------+-----+
|1000245| PRADEEP|5000.00| null| 100| 0.00|
|1000258| BLAKE|2850.00| 1000245| 300|50.00|
|1000262| CLARK|2450.00| 1000245| 100|50.00|
|1000276| JONES|2975.00| 1000245| 200|75.00|
|1000288| SCOTT|3000.00| 1000276| 200| 0.00|
|1000292| FORD|3000.00| 1000276| 200| 0.00|
|1000294| SMITH| 800.00| 1000292| 200| 0.00|
|1000299| ALLEN|1600.00| 1000258| 300| 0.00|
|1000310| WARD|1250.00| 1000258| 300|50.00|
|1000312| MARTIN|1250.00| 1000258| 300|50.00|
|1000315| TURNER|1500.00| 1000258| 300| 0.00|
|1000326| ADAMS|1100.00| 1000288| 200| 0.00|
|1000336| JAMES| 950.00| 1000258| 300|50.00|
|1000346| MILLER|1300.00| 1000262| 100| 0.00|
|1000347| DAVID|1400.00| 1000245| 500| 0.00|
+-------+--------+-------+----------+-------+-----+
➠
sum function(): sum function can be used to calculate sum of each column passed to this function for each group. This function can be applied to only numeric columns.
- Example 1: Simple sum of salary grouped over department number(dept_no).
empdf.groupBy("dept_no").sum("salary").show()
+-------+-----------+
|dept_no|sum(salary)|
+-------+-----------+
| 100| 8750.00|
| 300| 9400.00|
| 500| 1400.00|
| 200| 10875.00|
+-------+-----------+
- Example 2: Sum of salary and commission(comm) grouped over department number(dept_no).
empdf.groupBy("dept_no").sum("salary", "comm").show()
+-------+-----------+---------+
|dept_no|sum(salary)|sum(comm)|
+-------+-----------+---------+
| 100| 8750.00| 50.00|
| 300| 9400.00| 200.00|
| 500| 1400.00| 0.00|
| 200| 10875.00| 75.00|
+-------+-----------+---------+
- Example 3: Sum of salary and commission(comm) grouped over multiple columns(dept_no, manager_id).
empdf.groupBy("dept_no", "manager_id").sum("salary", "comm").show()
+-------+----------+-----------+---------+
|dept_no|manager_id|sum(salary)|sum(comm)|
+-------+----------+-----------+---------+
| 300| 1000258| 6550.00| 150.00|
| 100| null| 5000.00| 0.00|
| 200| 1000276| 6000.00| 0.00|
| 500| 1000245| 1400.00| 0.00|
| 200| 1000292| 800.00| 0.00|
| 300| 1000245| 2850.00| 50.00|
| 200| 1000288| 1100.00| 0.00|
| 100| 1000245| 2450.00| 50.00|
| 100| 1000262| 1300.00| 0.00|
| 200| 1000245| 2975.00| 75.00|
+-------+----------+-----------+---------+
➠
count function(): count function can be used to count number of records for each group.
➠
min function(): min function can be used to determine minimum value in each column passed to this function for each group. This function can be applied to only numeric columns but min can be used for non-numerical columns inside 'agg' function.
- Example 1: Minimum salary in each department( grouped over dept_no).
empdf.groupBy("dept_no").min("salary", "emp_name").show()
+-------+-----------+
|dept_no|min(salary)|
+-------+-----------+
| 100| 1300.00|
| 300| 950.00|
| 500| 1400.00|
| 200| 800.00|
+-------+-----------+
- Example 2: Minimum salary and comm in each department( grouped over dept_no).
empdf.groupBy("dept_no").min("salary","comm").show()
+-------+-----------+---------+
|dept_no|min(salary)|min(comm)|
+-------+-----------+---------+
| 100| 1300.00| 0.00|
| 300| 950.00| 0.00|
| 500| 1400.00| 0.00|
| 200| 800.00| 0.00|
+-------+-----------+---------+
- Example 3: Minimum salary and commission(comm) grouped over multiple columns(dept_no, manager_id).
empdf.groupBy("dept_no", "manager_id").min("salary", "comm").show()
+-------+----------+-----------+---------+
|dept_no|manager_id|min(salary)|min(comm)|
+-------+----------+-----------+---------+
| 300| 1000258| 950.00| 0.00|
| 100| null| 5000.00| 0.00|
| 200| 1000276| 3000.00| 0.00|
| 500| 1000245| 1400.00| 0.00|
| 200| 1000292| 800.00| 0.00|
| 300| 1000245| 2850.00| 50.00|
| 200| 1000288| 1100.00| 0.00|
| 100| 1000245| 2450.00| 50.00|
| 100| 1000262| 1300.00| 0.00|
| 200| 1000245| 2975.00| 75.00|
+-------+----------+-----------+---------+
➠
max function(): max function can be used to determine maximum value in each column passed to this function for each group. This function can be applied to only numeric columns but max can be used for non-numerical columns inside 'agg' function.
- Example 1: Maximum salary in each department( grouped over dept_no).
empdf.groupBy("dept_no").max("salary").show()
+-------+-----------+
|dept_no|max(salary)|
+-------+-----------+
| 100| 5000.00|
| 300| 2850.00|
| 500| 1400.00|
| 200| 3000.00|
+-------+-----------+
- Example 2: Maximum salary and comm in each department( grouped over dept_no).
empdf.groupBy("dept_no").max("salary","comm").show()
+-------+-----------+---------+
|dept_no|max(salary)|max(comm)|
+-------+-----------+---------+
| 100| 5000.00| 50.00|
| 300| 2850.00| 50.00|
| 500| 1400.00| 0.00|
| 200| 3000.00| 75.00|
+-------+-----------+---------+
- Example 3: Maximum salary and commission(comm) grouped over multiple columns(dept_no, manager_id).
empdf.groupBy("dept_no", "manager_id").max("salary", "comm").show()
+-------+----------+-----------+---------+
|dept_no|manager_id|max(salary)|max(comm)|
+-------+----------+-----------+---------+
| 300| 1000258| 1600.00| 50.00|
| 100| null| 5000.00| 0.00|
| 200| 1000276| 3000.00| 0.00|
| 500| 1000245| 1400.00| 0.00|
| 200| 1000292| 800.00| 0.00|
| 300| 1000245| 2850.00| 50.00|
| 200| 1000288| 1100.00| 0.00|
| 100| 1000245| 2450.00| 50.00|
| 100| 1000262| 1300.00| 0.00|
| 200| 1000245| 2975.00| 75.00|
+-------+----------+-----------+---------+
➠
avg function(): avg function can be used to calculate average values of each column passed to this function for each group. This function can be applied to only numeric columns.
- Example 1: Average salary for each department( grouped over dept_no).
empdf.groupBy("dept_no").avg("salary").show()
+-------+-----------+
|dept_no|avg(salary)|
+-------+-----------+
| 100|2916.666667|
| 300|1566.666667|
| 500|1400.000000|
| 200|2175.000000|
+-------+-----------+
- Example 2: Average salary and comm for each department( grouped over dept_no).
empdf.groupBy("dept_no").avg("salary","comm").show()
+-------+-----------+---------+
|dept_no|avg(salary)|avg(comm)|
+-------+-----------+---------+
| 100|2916.666667|16.666667|
| 300|1566.666667|33.333333|
| 500|1400.000000| 0.000000|
| 200|2175.000000|15.000000|
+-------+-----------+---------+
- Example 3: Average salary and commission(comm) grouped over multiple columns(dept_no, manager_id).
empdf.groupBy("dept_no", "manager_id").avg("salary", "comm").show()
+-------+----------+-----------+---------+
|dept_no|manager_id|avg(salary)|avg(comm)|
+-------+----------+-----------+---------+
| 300| 1000258|1310.000000|30.000000|
| 100| null|5000.000000| 0.000000|
| 200| 1000276|3000.000000| 0.000000|
| 500| 1000245|1400.000000| 0.000000|
| 200| 1000292| 800.000000| 0.000000|
| 300| 1000245|2850.000000|50.000000|
| 200| 1000288|1100.000000| 0.000000|
| 100| 1000245|2450.000000|50.000000|
| 100| 1000262|1300.000000| 0.000000|
| 200| 1000245|2975.000000|75.000000|
+-------+----------+-----------+---------+
➠
agg function(): agg function can be used if multiple aggregate functions need to be applied in a single 'Select' statement.
Only 1 aggregate function for a single column will return result when using dictionary as parameter in agg function.
Columns functions can be used to pass same column for multiple aggregate functions.
- Example 1: Using count and sum together for a single column.
empdf.groupBy("dept_no").agg({"*":"count","salary":"sum"}).show()
+-------+--------+-----------+
|dept_no|count(1)|sum(salary)|
+-------+--------+-----------+
| 100| 3| 8750.00|
| 300| 6| 9400.00|
| 500| 1| 1400.00|
| 200| 5| 10875.00|
+-------+--------+-----------+
- Example 2: If you have noticed in the below example, only last function in the dictionary was applied to salary column even thought there are multiple functions for the salary.
This is the limitation of 'agg' function as mentioned above.
empdf.groupBy("dept_no").agg({"*":"count","salary":"sum","salary":"max","salary":"min","emp_name":"max"}).show()
+-------+-------------+--------+-----------+
|dept_no|max(emp_name)|count(1)|min(salary)|
+-------+-------------+--------+-----------+
| 100| PRADEEP| 3| 1300.00|
| 300| WARD| 6| 950.00|
| 500| DAVID| 1| 1400.00|
| 200| SMITH| 5| 800.00|
+-------+-------------+--------+-----------+
- Example 3: Columns functions can be used inside 'agg' function to pass same column for multiple aggregate functions.
from pyspark.sql.functions import min, max, avg, sum, count
empdf.groupBy("dept_no").agg(sum("salary"), min("salary"),max("salary"),avg("salary"),count("*")).show()
+-------+-----------+-----------+-----------+-----------+--------+
|dept_no|sum(salary)|min(salary)|max(salary)|avg(salary)|count(1)|
+-------+-----------+-----------+-----------+-----------+--------+
| 100| 8750.00| 1300.00| 5000.00|2916.666667| 3|
| 300| 9400.00| 950.00| 2850.00|1566.666667| 6|
| 500| 1400.00| 1400.00| 1400.00|1400.000000| 1|
| 200| 10875.00| 800.00| 3000.00|2175.000000| 5|
+-------+-----------+-----------+-----------+-----------+--------+
- Example 4: Same as above example but with column aliases for aggregated functions
from pyspark.sql.functions import min, max, avg, sum, count
empdf.groupBy("dept_no").agg(sum("salary").alias("sum_salary"), min("salary").alias("min_salary"), max("salary").alias("max_salary"), avg("salary").alias("avg_salary"), count("*").alias("count_*") ).show()
+-------+----------+----------+----------+-----------+-------+
|dept_no|sum_salary|min_salary|max_salary| avg_salary|count_*|
+-------+----------+----------+----------+-----------+-------+
| 100| 8750.00| 1300.00| 5000.00|2916.666667| 3|
| 300| 9400.00| 950.00| 2850.00|1566.666667| 6|
| 500| 1400.00| 1400.00| 1400.00|1400.000000| 1|
| 200| 10875.00| 800.00| 3000.00|2175.000000| 5|
+-------+----------+----------+----------+-----------+-------+