PySpark: Dataframe To File (Part 2)

Pradeep

PySpark: Dataframe To File(Part 2)

This tutorial will explain how to write from Spark dataframe into various types of files(such as JSON, parquet, ORC and Avro).

DataFrameWriter "write" can be used to export data from Spark dataframe to most of the common file formats.
By default, DataFrameWriter will create as many number of files as there will be partitions in dataframe.
coalesce() function can be used to reduce number of partitions and thereby reducing the number of files created by DataFrameWriter from dataframe.
Both option() and mode() functions can be used to alter the behavior of write operation but in different sense.
Visit write modes page to understand how mode function can be used to alter write behaviour when data/table already exists.
you can also visit dataframe options page to understand how option/options function can be used to alter other write behaviour.
Following topics will be covered on this page:

➠ Write as JSON file: json() function can be used to write data into JSON file. This functions takes a path to directory where file(s) need to be created.

Example 1: This is simple write example where data will be written in directory "json_dir".
```
df.write.json("file:///path_to_directory/json_dir")
```
Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
```
df.write.mode("overwrite").json("file:///path_to_directory/json_dir")
```
Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
```
df.coalesce(1).write.mode("overwrite").json("file:///path_to_directory/json_dir")
```

➠ Write as Parquet file: parquet() function can be used to read data from Parquet file. This functions takes a path to directory where file(s) need to be created.

Example 1: This is simple write example where data will be written in directory "parquet_dir".
```
df.write.parquet("file:///path_to_directory/parquet_dir")
```
Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
```
df.write.mode("overwrite").parquet("file:///path_to_directory/parquet_dir")
```
Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
```
df.coalesce(1).write.mode("overwrite").parquet("file:///path_to_directory/parquet_dir")
```

➠ Write as ORC file: Spark also support ORC file format which is mostly used in Hive. orc() function can be used for this purpose. This functions takes a path to directory where file(s) need to be created.

Example 1: This is simple write example where data will be written in directory "orc_dir".
```
df.write.orc("file:///path_to_directory/orc_dir")
```
Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
```
df.write.mode("overwrite").orc("file:///path_to_directory/orc_dir")
```
Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.
```
df.coalesce(1).write.mode("overwrite").orc("file:///path_to_directory/orc_dir")
```

➠ Write as Avro file: Avro file format is not native to Spark and a spark-avro_x.xx-x.x.x.jar jar is required to be added in Spark library to read/write Avro files. Spark Avro jar can be downloaded from maven repository, here is the link to download spark-avro_2.12-3.0.3.jar.

Example 1: This is simple write example where data will be written in directory "avro_dir".
```
df.write.format("avro").save("file:///path_to_directory/avro_dir")
```
Example 2: Overwrite existing data (directory) with the content of dataframe if exists else create new directory.
```
df.write.mode("overwrite").format("avro").save("file:///path_to_directory/avro_dir")
```

Example 3: Write into a single file by reducing the partitions to 1 using coalesce() function.


df.coalesce(1).write.mode("overwrite").format("avro").save("file:///path_to_directory/avro_dir")

This tutorial will explain how to write from Spark dataframe into various types of files(such as JSON, parquet, ORC and Avro).

dbmstutorials.com

PySpark: Dataframe To File(Part 2)