This tutorial will explain various approaches with examples on how to modify / update existing column values in a dataframe.

PySpark: Dataframe Modify Columns

This tutorial will explain various approaches with examples on how to modify / update existing column values in a dataframe. Below listed topics will be explained with examples on this page, click on item in the below list and it will take you to the respective section of the page:


Update Column using withColumn: withColumn() function can be used on a dataframe to either add a new column or replace an existing column that has same name. withColumn() function can cause performance issues and even "StackOverflowException" if it is called multiple times using loop to add multiple columns. Spark suggests to use "select" function to add multiple columns at once.
Update Column using select: select() function can be used on existing columns to update column or add new column to the dataframe. Only downside is that you have to specify all the columns(list can be accessed using df.columns).
Update Column value based on condition: Column values are updated for db_type column using when() / otherwise functions which are equivalent to CASE / ELSE Statement in SQL.
Update Column value using other dataframe: Column values can be updated using other dataframe with the help of outer joins. You can visit dataframe join page to understand more about joins.
Change Column datatype in dataframe: astype() column function can be used change existing datatype of a column to required one. Please ensure that target datatype should be compatible with source datatype else value will become null.