PySpark: Dataframe Drop Columns

Pradeep

PySpark: Dataframe Drop Columns

This tutorial will explain various approaches with examples on how to drop an existing column(s) from a dataframe. Below listed topics will be explained with examples on this page, click on item in the below list and it will take you to the respective section of the page:

Drop Column(s) using drop function
Drop Column(s) using select
- Drop Column using select/list
Drop Column(s) after join

Drop Column(s) inplace

Sample Data: Dataset used in the below examples can be downloaded from here (1st file) and here (2nd file).


df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+


df_other = spark.read.csv("file:///path_to_files/join_example_file_2.csv", header=True)

df_other.show()
+-----+-----------+-------+
|db_id|    db_name|db_type|
+-----+-----------+-------+
|   17|     Oracle|  RDBMS|
|   19|    MongoDB|  NOSQL|
|   21|SingleStore|  RDBMS|
|   22|      Mysql|  RDBMS|
|   14|  Snowflake|  RDBMS|
+-----+-----------+-------+

➠ Drop Column using drop function: drop() function can be used on a dataframe to drop existing column(s). If the dataframe schema does not contain the given column then it will not fail and will return the same dataframe.

Syntax:
```
drop(column name / comma separated column names)
```
This function takes 1 parameter i.e. either name of the column as string or comma separated strings of column names.

Example 1: "db_type" Column was dropped from "df" dataframe in the below example.


df_updated = df.drop("db_type")

df_updated.show()
+-----+---------+
|db_id|  db_name|
+-----+---------+
|   12| Teradata|
|   14|Snowflake|
|   15|  Vertica|
|   12| Teradata|
|   22|    Mysql|
+-----+---------+

Example 2: Column db_type_test is not present in the given dataframe, therefore dataframe was returned as it is in the below example.


df_updated = df.drop("db_type_cd")

df_updated.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

Example 3: Comma separated Column names "db_id","db_type" were input to drop these 2 columns from the dataframe in the below example.


df_updated = df.drop("db_id","db_type")

df_updated.show()
+---------+
|  db_name|
+---------+
| Teradata|
|Snowflake|
|  Vertica|
| Teradata|
|    Mysql|
+---------+

➠ Drop Column using select: select function can also be used to drop existing column(s), user has to specify all the dataframe columns(list can be accessed using df.columns) in select i.e columns which are required in final output and don't mention columns which need to be dropped.

Example 1: db_type was required to be dropped, therefore this column was not listed inside the select function in the below example.


df_updated = df.select("db_id", "db_name")

df_updated.show()
+-----+---------+
|db_id|  db_name|
+-----+---------+
|   12| Teradata|
|   14|Snowflake|
|   15|  Vertica|
|   12| Teradata|
|   22|    Mysql|
+-----+---------+

Example 2: Column was dropped from the column list and updated list(without dropped column) is passed as input to select function in the below example.


#same example as above but by using list
col_name_to_drop="db_type"
column_li = df.columns
if col_name_to_drop in column_li:
    column_li.remove(col_name_to_drop)

df_updated = df.select(column_li)

df_updated.show()
+-----+---------+
|db_id|  db_name|
+-----+---------+
|   12| Teradata|
|   14|Snowflake|
|   15|  Vertica|
|   12| Teradata|
|   22|    Mysql|
+-----+---------+

➠ Drop Column(s) after join: Many times it is required to drop duplicate columns(drop column with same name) after join . Columns can be dropped using one of the two ways shown above.

Example 1: "db_id" column from df_other dataframe was dropped after join with "df" dataframe.


df_updated = df.join(df_other,df.db_id==df_other.db_id).drop(df_other.db_id)

df_updated.show()
+-----+---------+-------+---------+-------+
|db_id|  db_name|db_type|  db_name|db_type|
+-----+---------+-------+---------+-------+
|   14|Snowflake|CloudDB|Snowflake|  RDBMS|
|   22|    Mysql|  RDBMS|    Mysql|  RDBMS|
+-----+---------+-------+---------+-------+

Example 2: "db_id" and "db_name" columns from df_other dataframe were dropped after join with "df" dataframe. Columns that need to be dropped, were not listed in the select function in the below example.


df_updated = df.join(df_other,df.db_id==df_other.db_id).select(df.db_id, df.db_name, df.db_type, df_other.db_type.alias("other_db_type"))

df_updated.show()
+-----+---------+-------+-------------+
|db_id|  db_name|db_type|other_db_type|
+-----+---------+-------+-------------+
|   14|Snowflake|CloudDB|        RDBMS|
|   22|    Mysql|  RDBMS|        RDBMS|
+-----+---------+-------+-------------+

➠ Drop Column(s) inplace: Column can be dropped from a dataframe and stored in the same dataframe variable so that it looks like a inplace operation.

Example 1: db_type was dropped in the below example using drop function.


df = df.drop("db_type")

df.show()
+-----+---------+
|db_id|  db_name|
+-----+---------+
|   12| Teradata|
|   14|Snowflake|
|   15|  Vertica|
|   12| Teradata|
|   22|    Mysql|
+-----+---------+

This tutorial will explain various approaches with examples on how to drop an existing column(s) from a dataframe.

dbmstutorials.com

PySpark: Dataframe Drop Columns