PySpark: Dataframe Modify Columns

Pradeep

PySpark: Dataframe Modify Columns

This tutorial will explain various approaches with examples on how to modify / update existing column values in a dataframe. Below listed topics will be explained with examples on this page, click on item in the below list and it will take you to the respective section of the page:

Update Column using withColumn
Update Column using select
Update Column value based on condition
Update Column value using other dataframe

Change Column datatype in dataframe

Sample Data: Dataset used in the below examples can be downloaded from here (1st file) and here (2nd file).


df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+


df_other = spark.read.csv("file:///path_to_files/join_example_file_2.csv", header=True)

df_other.show()
+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   21|SingleStore|  RDBMS|
|   22|      Mysql|  RDBMS|
|   14|  Snowflake|  RDBMS|
+-----+-----------+-------+

➠ Update Column using withColumn: withColumn() function can be used on a dataframe to either add a new column or replace an existing column that has same name. withColumn() function can cause performance issues and even "StackOverflowException" if it is called multiple times using loop to add multiple columns. Spark suggests to use "select" function to add multiple columns at once.

Syntax:
```
withColumn(columnName, columnLogic/columnExpression)
```
This function takes 2 parameters, 1st parameter is the name of new or existing column and 2nd parameter is the column logic / column expression for the new or existing column passed in the 1st paramter.

Example 1: Value of existing column db_type is updated to literal string "Relation Database" for all the rows.


from pyspark.sql.functions import col,lit
df_updated = df.withColumn("db_type",lit("Relation Database"))

df_updated.show()
+-----+---------+-----------------+
|db_id|  db_name|          db_type|
+-----+---------+-----------------+
|   12| Teradata|Relation Database|
|   14|Snowflake|Relation Database|
|   15|  Vertica|Relation Database|
|   12| Teradata|Relation Database|
|   22|    Mysql|Relation Database|
+-----+---------+-----------------+

lit() function is used to pass literals i.e. hardcoded (default) values, spark won't take hardcoded values directly.

Example 2: Update db_id column values by using mod operator on the db_id column.


from pyspark.sql.functions import col,lit
df_updated = df.withColumn("db_id", col("db_id")%10)

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|  2.0| Teradata|  RDBMS|
|  4.0|Snowflake|CloudDB|
|  5.0|  Vertica|  RDBMS|
|  2.0| Teradata|  RDBMS|
|  2.0|    Mysql|  RDBMS|
+-----+---------+-------+

lit() function is used to pass literals i.e. hardcoded values, spark won't take hardcoded values directly.

Example 3: Update db_name column values by using substring function on the db_name column.


from pyspark.sql.functions import col,lit,substring
df_updated = df.withColumn("db_name",substring("db_name",1,4))

df_updated.show()
+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|   12|   Tera|  RDBMS|
|   14|   Snow|CloudDB|
|   15|   Vert|  RDBMS|
|   12|   Tera|  RDBMS|
|   22|   Mysq|  RDBMS|
+-----+-------+-------+

➠ Update Column using select: select() function can be used on existing columns to update column or add new column to the dataframe. Only downside is that you have to specify all the columns(list can be accessed using df.columns).

Example 1: Literal string "Relation Database" is updated for all the rows by using lit() function and alias is used to rename processed column name back to db_type.


from pyspark.sql.functions import col,lit
df_updated = df.select("db_id", "db_name", lit("Relation Database").alias("db_type"))

df_updated.show()
+-----+---------+-----------------+
|db_id|  db_name|          db_type|
+-----+---------+-----------------+
|   12| Teradata|Relation Database|
|   14|Snowflake|Relation Database|
|   15|  Vertica|Relation Database|
|   12| Teradata|Relation Database|
|   22|    Mysql|Relation Database|
+-----+---------+-----------------+

Example 2: Update db_id column values by using mod operator and alias is used to rename processed column name back to db_id


from pyspark.sql.functions import col,lit
df_updated = df.select((col("db_id")%10).alias("db_id"), "db_name", "db_type")

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|  2.0| Teradata|  RDBMS|
|  4.0|Snowflake|CloudDB|
|  5.0|  Vertica|  RDBMS|
|  2.0| Teradata|  RDBMS|
|  2.0|    Mysql|  RDBMS|
+-----+---------+-------+

lit() function is used to pass literals i.e. hardcoded values, spark won't take hardcoded values directly.

Example 3: Update column values by using substring function on the db_name column and alias is used to rename processed column name back to db_type.


from pyspark.sql.functions import col,lit,substring
df_updated = df.select("db_id", substring("db_name",1,4).alias("db_name"), "db_type")

df_updated.show()
+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|   12|   Tera|  RDBMS|
|   14|   Snow|CloudDB|
|   15|   Vert|  RDBMS|
|   12|   Tera|  RDBMS|
|   22|   Mysq|  RDBMS|
+-----+-------+-------+

substring() function is used to get a part of string from the db_name column.

Example 4: Update multiple columns at once using select function.


from pyspark.sql.functions import col,lit,substring
df_updated = df.select( (col("db_id")%10).alias("db_id"), substring("db_name",1,4).alias("db_name"), "db_type" )

df_updated.show()
+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|  2.0|   Tera|  RDBMS|
|  4.0|   Snow|CloudDB|
|  5.0|   Vert|  RDBMS|
|  2.0|   Tera|  RDBMS|
|  2.0|   Mysq|  RDBMS|
+-----+-------+-------+

➠ Update Column value based on condition: Column values are updated for db_type column using when() / otherwise functions which are equivalent to CASE / ELSE Statement in SQL.

Syntax:
```
when(condition, value to return if condition is true)

otherwise(value if non of condition met)
```
→ when() function takes 2 parameters, 1st parameter is the condition which will evaluate to True/False and 2nd parameter is the value to be returned if condition is evaluated to true.
→ otherwise() function takes 1 parameter, value that need to be returned if none of the condition is met.

Example 1: Column values are updated for db_type column using when and otherwise column functions along with select function.


from pyspark.sql.functions import col,lit,when
df_updated = df.select("db_id", "db_name", when( col("db_type")=="RDBMS", "On Premise").when( col("db_type")=="CloudDB","Cloud" ).otherwise( "Not Known" ).alias("db_type"))

df_updated.show()
+-----+---------+----------+
|db_id|  db_name|   db_type|
+-----+---------+----------+
|   12| Teradata|On Premise|
|   14|Snowflake|     Cloud|
|   15|  Vertica|On Premise|
|   12| Teradata|On Premise|
|   22|    Mysql|On Premise|
+-----+---------+----------+

Example 2: Column values are updated for db_type column using when and otherwise column functions along with withColumn function.


from pyspark.sql.functions import col,when
df_updated = df.withColumn("db_type", when( col("db_type")=="RDBMS", "On Premise").when( col("db_type")=="CloudDB","Cloud" ).otherwise( "Not Known" ).alias("db_type"))

df_updated.show()
+-----+---------+----------+
|db_id|  db_name|   db_type|
+-----+---------+----------+
|   12| Teradata|On Premise|
|   14|Snowflake|     Cloud|
|   15|  Vertica|On Premise|
|   12| Teradata|On Premise|
|   22|    Mysql|On Premise|
+-----+---------+----------+

➠ Update Column value using other dataframe: Column values can be updated using other dataframe with the help of outer joins. You can visit dataframe join page to understand more about joins.

Example 1: Updating db_type values in "df" dataframe using "df_other" dataframe with the help of left outer join.


from pyspark.sql.functions import col,when
df_updated = df.join(df_other, "db_id", "left").select("db_id", df.db_name, when(df_other.db_type.isNull(), df.db_type).otherwise(df_other.db_type).alias("db_type"))

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|  RDBMS|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

➠ Change Column datatype in dataframe: astype() column function can be used change existing datatype of a column to required one. Please ensure that target datatype should be compatible with source datatype else value will become null.

Example 1: Datatype of db_id column was changed from string to integer.


df.printSchema()
root
 |-- db_id: string (nullable = true)
 |-- db_name: string (nullable = true)
 |-- db_type: string (nullable = true)
 

df_updated= df.select(col("db_id").astype("integer"), "db_name", "db_type")

df_updated.printSchema()
root
 |-- db_id: integer (nullable = true)
 |-- db_name: string (nullable = true)
 |-- db_type: string (nullable = true)

Example 2: As target datatype i.e. integer is not compatible with source datatype i.e. string, db_name column values became null as shown in the below example.


df_updated= df.select(col("db_id").astype("integer"), "db_name", "db_type")

df_updated.show()
+-----+-------+-------+
|db_id|db_name|db_type|
+-----+-------+-------+
|   12|   null|  RDBMS|
|   14|   null|CloudDB|
|   15|   null|  RDBMS|
|   12|   null|  RDBMS|
|   22|   null|  RDBMS|
+-----+-------+-------+

This tutorial will explain various approaches with examples on how to modify / update existing column values in a dataframe.

dbmstutorials.com

PySpark: Dataframe Modify Columns