PySpark: Dataframe To File (Part 1)

Pradeep

PySpark: Dataframe To File(Part 1)

This tutorial will explain how to write Spark dataframe into various types of comma separated value(CSV) files or other delimited files.

DataFrameWriter's "write" function can be used to export data from Spark dataframe to csv file(s).
Default delimiter for csv function in spark is comma(,).
By default, DataFrameWriter will create as many number of files as there will be partitions in dataframe.
coalesce() function can be used to reduce number of partitions and thereby reducing the number of files created by DataFrameWriter from dataframe.
Both option() and mode() functions can be used to alter the behavior of write operation but in different sense.
Visit write modes page to understand how mode function can be used to alter write behaviour when data/table already exists.
you can also visit dataframe options page to understand how option/options function can be used to alter other write behaviour.
Following topics will be covered on this page:

➠ Write CSV file(without header): By default, sparks creates a comma separated values file without header if write operation is invoked on dataframe.

Example 1: This is simple write example where data will be written in directory "csv_without_header".
```
df.write.csv("file:///path_to_directory/csv_without_header")
```
Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
```
df.write.mode("overwrite").csv("file:///path_to_directory/csv_without_header")
```

Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.


df.coalesce(1).write.mode("overwrite").csv("file:///path_to_directory/csv_without_header")

➠ CSV file(with header): Spark provides a way to write header columns as 1st row to a file(s) using either option() or options() functions. Options function is used in the below example.

Example 1: This is simple write example where data will be written in directory "csv_with_header".
```
df.write.options(header=True).csv("file:///path_to_directory/csv_with_header")
```
Example 2: Append the content of the dataframe to existing data directory if exists else create new directory.
```
df.write.options(header=True).mode("append").csv("file:///path_to_directory/csv_with_header")
```

➠ Write Delimited file: Although CSV files are also delimited files, these examples are separately mentioned here to write delimited files with customized separator i.e delimiter other than comma(,).

Tab delimited file: Many people also choose tab as delimiter for delimited files.


df.write.options(delimiter="\t").csv("file:///path_to_directory/tab_delimited")

Pipe delimited file: Pipe(|) is one of the other common delimiter used for delimited files.


df.write.options(header=True, delimiter="|").csv("file:///path_to_directory/pipe_delimited")

Control(Ctrl) A delimited file: Spark can also write control characters such as CTRL A as delimiter.


cat -v /path_to_directory/ctrl_a_delimited_file.txt 
**********File content**********
db_id^Adb_name^Adb_type
12^ATeradata^ARDBMS
14^ASnowflake^ACloudDB
********************************


df.write.options(header=True, delimiter="\01").csv("file:///path_to_directory/ctrl_a_delimited")

Multicharacter delimited file: Spark now supports Multicharacter delimiter to read and write files.


cat /path_to_directory/multichar_delimited_file.txt 
**********File content**********
db_id|*|db_name|*|db_type
12|*|Teradata|*|RDBMS
14|*|Snowflake|*|CloudDB
********************************


df.write.options(header=True, delimiter="|*|").csv("file:///path_to_directory/multichar_delimited")

Delimited file with new line characters: Write operation with newline character is simple one.


df_multiline.show()
+-----+---------+--------+
|db_id|  db_name| db_type|
+-----+---------+--------+
|   12| Teradata|RDBMS
DB|
|   14|Snowflake|
CloudDB|
|   15|  Vertica|   RDBMS|
|   17|Oracle
DB|   RDBMS|
|   19|  MongoDB|   NOSQL|
+-----+---------+--------+


df_multiline.write.options(header=True).csv("file:///path_to_directory/multiline_file_with_header")

➠ Write CSV to HDFS: Spark can also write data to HDFS system. As such there is no syntax difference in writing to Local / server or HDFS, only difference will be the path.

Example 1:


df.write.csv("hdfs://localhost:9000/user/hive/warehouse/retail.db/orders")

Example 2: Write csv file with header.


df.write.options(header=True).csv("hdfs://localhost:9000/user/hive/warehouse/retail.db/orders1")

➠ Write File with Specific Name: Writing directly as file is not supported in Spark but there are workarounds.

Example 1: By converting Spark dataframe to Pandas dataframe. Always use "index=False" when writing csv from Pandas dataframe else there will be extra column at the beginning with index. This option might not be good if dataframe size is very big.
```
import pandas as pd

df_pd = df.toPandas() #Spark dataframe (df)
df_pd.to_csv("/path_to_file/test2.csv", sep="|", index=False)
```

This tutorial will explain how to write Spark dataframe into various types of comma separated value(CSV) files or other delimited files.

dbmstutorials.com

PySpark: Dataframe To File(Part 1)