PySpark: Dataframe Pivot

Pradeep

PySpark: Dataframe Pivot

This tutorial will explain the pivot function available in Pyspark that can be used to transform rows into columns.

Once groupBy function is used to apply Pivot function, it will result in shuffle partition. This mean that the data present in current partitions will be reshuffled into new partitions and the number of partitions in target dataframe will be equivalent to the value set for "spark.sql.shuffle.partitions" property, default value for this property is 200.
Visit this page to understand more about shuffle partitions.
Syntax: This function takes 2 parameter, 1st parameter is mandatory but 2nd parameter is optional.
```
pivot(pivot_column, values=None)
```
- → The first parameter (pivot_column) takes a name of column on which pivot is required (i.e. rows to columns).
- → The second parameter (columns) optionally takes list of distinct values of the pivot column.
Note: If 2nd parameter is not specified, Spark will have to internally compute the list of distinct values and therefore it will be less efficient.

Sample Data: Dataset used in the below examples can be downloaded from here .
```
df = spark.read.parquet("file:///path_to_files/ledger.parquet")

df.show()
+-------+-------+-----+
|year_nr|Quarter|Sales|
+-------+-------+-----+
|   2015|     Q1|   90|
|   2015|     Q2|   70|
|   2015|     Q3|  130|
|   2015|     Q4|   30|
|   2016|     Q1|   40|
|   2016|     Q2|   50|
|   2016|     Q3|  120|
|   2016|     Q4|   20|
|   2015|     Q2| 1000|
|   2015|     Q3| 1500|
|   2016|     Q1| 1100|
|   2016|     Q4|  150|
+-------+-------+-----+
```

➠ Example 1: This is basic example of pivot (transforming rows into columns) without passing values (2nd) parameter.


df_updated= df.groupBy("year_nr").pivot("quarter").sum("sales")

df_updated.show()
+-------+----+----+----+---+
|year_nr|  Q1|  Q2|  Q3| Q4|
+-------+----+----+----+---+
|   2015|  90|1070|1630| 30|
|   2016|1140|  50| 120|170|
+-------+----+----+----+---+

➠ Example 2: Example where all the distinct values were passed as a list.


df_updated= df.groupBy("year_nr").pivot("quarter", ["Q1","Q2","Q3","Q4"]).sum("sales")

df_updated.show()
+-------+----+----+----+---+
|year_nr|  Q1|  Q2|  Q3| Q4|
+-------+----+----+----+---+
|   2015|  90|1070|1630| 30|
|   2016|1140|  50| 120|170|
+-------+----+----+----+---+

➠ Example 3: Example where only few of the values were passed as a list.


df_updated= df.groupBy("year_nr").pivot("quarter", ["Q1","Q3","Q4"]).sum("sales")

df_updated.show()
+-------+----+----+---+
|year_nr|  Q1|  Q3| Q4|
+-------+----+----+---+
|   2015|  90|1630| 30|
|   2016|1140| 120|170|
+-------+----+----+---+

➠ Example 4: Example where only first value of the grouped and pivot column will be returned.

from pyspark.sql.functions import first

df_updated= df.groupBy("year_nr").pivot("quarter", ["Q1","Q2","Q3","Q4"]).agg(f.first("sales"))

df_updated.show()
+-------+---+---+---+---+
|year_nr| Q1| Q2| Q3| Q4|
+-------+---+---+---+---+
|   2015| 90| 70|130| 30|
|   2016| 40| 50|120| 20|
+-------+---+---+---+---+

This tutorial will explain the pivot function available in Pyspark that can be used to transform rows into columns.

dbmstutorials.com

PySpark: Dataframe Pivot