This tutorial will explain various approaches with examples on how to add new columns or modify existing columns in a dataframe.

PySpark: Dataframe Add Columns

This tutorial will explain various approaches with examples on how to add new columns or modify existing columns in a dataframe. Below listed topics will be explained with examples on this page, click on item in the below list and it will take you to the respective section of the page:


List all Columns: columns attribute can be used on a dataframe to return all the column names as a list.
Add Column using withColumn: withColumn() function can be used on a dataframe to either add a new column or replace an existing column that has same name. withColumn() function can cause performance issues and even "StackOverflowException" if it is called multiple times using loop to add multiple columns. Spark suggests to use "select" function to add multiple columns at once.
Add Column using select: select() function can be used with existing columns to add new column to the dataframe. Only downside is that you have to specify all the columns(list can be accessed using df.columns) along with new column.
Add Column using other dataframe: Column can be added using other dataframe with the help of outer joins. You can visit dataframe join page to understand more about joins.