PySpark: Dataframe Rename Columns

Pradeep

PySpark: Dataframe Rename Columns

This tutorial will explain various approaches with examples on how to rename an existing column in a dataframe. Below listed topics will be explained with examples on this page, click on item in the below list and it will take you to the respective section of the page:

List all Columns
Rename Column using withColumnRenamed
Rename Column using select
Rename all columns of a dataframe(prefix)

Rename all columns of a dataframe(suffix)

Sample Data: Dataset used in the below examples can be downloaded from here .


df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

➠ List all Columns: columns attribute can be used on a dataframe to return all the column names as a list.


df.columns

Output: ['db_id', 'db_name', 'db_type']

➠ Rename Column using withColumnRenamed: withColumnRenamed() function can be used on a dataframe to rename existing column. If the dataframe schema does not contain the given column then it will not fail and will return the same dataframe.

Syntax:
```
withColumnRenamed(existingColumnName, newColumnName)
```
This function takes 2 string parameters, 1st parameter is the name of the existing column and 2nd parameter is the new name of the column.

Example 1: Column db_type was renamed to db_type_cd in the below example.


df_updated = df.withColumnRenamed("db_type","db_type_cd")

df_updated.show()
+-----+---------+----------+
|db_id|  db_name|db_type_cd|
+-----+---------+----------+
|   12| Teradata|     RDBMS|
|   14|Snowflake|   CloudDB|
|   15|  Vertica|     RDBMS|
|   12| Teradata|     RDBMS|
|   22|    Mysql|     RDBMS|
+-----+---------+----------+

Example 2: Column db_type_test is not present in the given dataframe, therefore dataframe was returned as it is in the below example.


df_updated = df.withColumnRenamed("db_type_test","db_type_cd")

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

➠ Rename Column using select: select function can also be used to rename existing column, only downside is that user has specify all the dataframe columns(list can be accessed using df.columns) in select i.e columns which are required in final output.

Example 1: Column db_type was renamed to db_type_cd in the below example using column functions col and alias inside select function.


from pyspark.sql.functions import col,lit,substring
df_updated = df.select("db_id", "db_name", col("db_type").alias("db_type_cd"))

df_updated.show()
+-----+---------+----------+
|db_id|  db_name|db_type_cd|
+-----+---------+----------+
|   12| Teradata|     RDBMS|
|   14|Snowflake|   CloudDB|
|   15|  Vertica|     RDBMS|
|   12| Teradata|     RDBMS|
|   22|    Mysql|     RDBMS|
+-----+---------+----------+



#same example as above but by using list
new_col_name="db_type_cd"
existing_col_name="db_type"
column_li = df.columns
if existing_col_name in column_li:
    idx = column_li.index(existing_col_name)
    column_li.pop(idx)
    column_li.insert(idx,col(existing_col_name).alias(new_col_name))

df_updated = df.select(column_li)

df_updated.show()
+-----+---------+----------+
|db_id|  db_name|db_type_cd|
+-----+---------+----------+
|   12| Teradata|     RDBMS|
|   14|Snowflake|   CloudDB|
|   15|  Vertica|     RDBMS|
|   12| Teradata|     RDBMS|
|   22|    Mysql|     RDBMS|
+-----+---------+----------+

➠ Rename all columns of a dataframe (Prefix): Python list comprehension is used along with "col" function to rename all the columns of a dataframe by adding a prefix value, you can visit this page to learn more about List comprehension. This is particularly helpful in avoiding name conflicts after joining 2 dataframes having common column names.


from pyspark.sql.functions import col
df_prefix = df.select([col(column).alias("prefix_"+column) for column in df.columns])
df_prefix.show()

+------------+--------------+--------------+
|prefix_db_id|prefix_db_name|prefix_db_type|
+------------+--------------+--------------+
|          12|      Teradata|         RDBMS|
|          14|     Snowflake|       CloudDB|
|          15|       Vertica|         RDBMS|
|          12|      Teradata|         RDBMS|
|          22|         Mysql|         RDBMS|
+------------+--------------+--------------+

➠ Rename all columns of a dataframe (Suffix): Python list comprehension is used along with "col" function to rename all the columns of a dataframe by adding a suffix value, you can visit this page to learn more about List comprehension. This is particularly helpful in avoiding name conflicts after joining 2 dataframes having common column names.


from pyspark.sql.functions import col
df_suffix = df.select([col(column).alias(column+"suffix") for column in df.columns])
df_suffix.show()

+------------+--------------+--------------+
|suffix_db_id|suffix_db_name|suffix_db_type|
+------------+--------------+--------------+
|          12|      Teradata|         RDBMS|
|          14|     Snowflake|       CloudDB|
|          15|       Vertica|         RDBMS|
|          12|      Teradata|         RDBMS|
|          22|         Mysql|         RDBMS|
+------------+--------------+--------------+

This tutorial will explain various approaches with examples on how to rename an existing column in a dataframe.

dbmstutorials.com

PySpark: Dataframe Rename Columns