PySpark: Dataframe To File(Part 2)
This tutorial will explain how to write from Spark dataframe into various types of files(such as JSON, parquet, ORC and Avro).
- DataFrameWriter "write" can be used to export data from Spark dataframe to most of the common file formats.
- By default, DataFrameWriter will create as many number of files as there will be partitions in dataframe.
- coalesce() function can be used to reduce number of partitions and thereby reducing the number of files created by DataFrameWriter from dataframe.
- Both option() and mode() functions can be used to alter the behavior of write operation but in different sense.
- Visit write modes page to understand how mode function can be used to alter write behaviour when data/table already exists.
- you can also visit dataframe options page to understand how option/options function can be used to alter other write behaviour.
- Following topics will be covered on this page:
➠
Write as JSON file: json() function can be used to write data into JSON file. This functions takes a path to directory where file(s) need to be created.
- Example 1: This is simple write example where data will be written in directory "json_dir".
df.write.json("file:///path_to_directory/json_dir")
- Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
df.write.mode("overwrite").json("file:///path_to_directory/json_dir")
- Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
df.coalesce(1).write.mode("overwrite").json("file:///path_to_directory/json_dir")
➠
Write as Parquet file: parquet() function can be used to read data from Parquet file. This functions takes a path to directory where file(s) need to be created.
- Example 1: This is simple write example where data will be written in directory "parquet_dir".
df.write.parquet("file:///path_to_directory/parquet_dir")
- Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
df.write.mode("overwrite").parquet("file:///path_to_directory/parquet_dir")
- Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
df.coalesce(1).write.mode("overwrite").parquet("file:///path_to_directory/parquet_dir")
➠
Write as ORC file: Spark also support ORC file format which is mostly used in Hive. orc() function can be used for this purpose. This functions takes a path to directory where file(s) need to be created.
- Example 1: This is simple write example where data will be written in directory "orc_dir".
df.write.orc("file:///path_to_directory/orc_dir")
- Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
df.write.mode("overwrite").orc("file:///path_to_directory/orc_dir")
- Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
df.coalesce(1).write.mode("overwrite").orc("file:///path_to_directory/orc_dir")
➠
Write as Avro file: Avro file format is not native to Spark and a spark-avro_x.xx-x.x.x.jar jar is required to be added in Spark library to read/write Avro files.
Spark Avro jar can be downloaded from maven repository, here is the link to
download spark-avro_2.12-3.0.3.jar.
- Example 1: This is simple write example where data will be written in directory "avro_dir".
df.write.format("avro").save("file:///path_to_directory/avro_dir")
- Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
df.write.mode("overwrite").format("avro").save("file:///path_to_directory/avro_dir")
- Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
df.coalesce(1).write.mode("overwrite").format("avro").save("file:///path_to_directory/avro_dir")