PySpark: Dataframe To File(Part 1)
This tutorial will explain how to write Spark dataframe into various types of comma separated value(CSV) files or other delimited files.
- DataFrameWriter's "write" function can be used to export data from Spark dataframe to csv file(s).
- Default delimiter for csv function in spark is comma(,).
- By default, DataFrameWriter will create as many number of files as there will be partitions in dataframe.
- coalesce() function can be used to reduce number of partitions and thereby reducing the number of files created by DataFrameWriter from dataframe.
- Both option() and mode() functions can be used to alter the behavior of write operation but in different sense.
- Visit write modes page to understand how mode function can be used to alter write behaviour when data/table already exists.
- you can also visit dataframe options page to understand how option/options function can be used to alter other write behaviour.
- Following topics will be covered on this page:
➠ Write CSV file(without header):
By default, sparks creates a comma separated values file without header if write operation is invoked on dataframe.
- Example 1: This is simple write example where data will be written in directory "csv_without_header".
- Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
- Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
➠ Spark provides a way to write header columns as 1st row to a file(s) using either option() or options() functions. Options function is used in the below example.
➠ Write Delimited file:
Although CSV files are also delimited files, these examples are separately mentioned here to write delimited files with customized separator i.e delimiter other than comma(,).
- Tab delimited file: Many people also choose tab as delimiter for delimited files.
- Pipe delimited file: Pipe(|) is one of the other common delimiter used for delimited files.
- Control(Ctrl) A delimited file: Spark can also write control characters such as CTRL A as delimiter.
cat -v /path_to_directory/ctrl_a_delimited_file.txt
- Multicharacter delimited file: Spark now supports Multicharacter delimiter to read and write files.
- Delimited file with new line characters: Write operation with newline character is simple one.
|db_id| db_name| db_type|
| 12| Teradata|RDBMS
| 15| Vertica| RDBMS|
| 19| MongoDB| NOSQL|
➠ Write CSV to HDFS:
Spark can also write data to HDFS system. As such there is no syntax difference in writing to Local / server or HDFS, only difference will be the path.
➠ Write File with Specific Name:
Writing directly as file is not supported in Spark but there are workarounds.