PySpark: File To Dataframe(Part 2)

Pradeep

PySpark: File To Dataframe(Part 2)

This tutorial will explain how to read various types of files(such as JSON, parquet, ORC and Avro) into Spark dataframe.

DataframeReader "spark.read" function can be used to import data into Spark dataframe from csv file(s).
By default, Spark will create as many number of partitions in dataframe as number of files in the read path.
repartition() function can be used to increase the number of partition in dataframe while reading the files.
options() or option() function can be used to alter the behavior of read operation. you can to visit dataframe options page to understand various options available to alter read operation behaviour.
Following topics will be covered on this page:

➠ Read JSON file: json() function can be used to read data from JSON file. Files used in this example can be downloaded from here(normal JSON) and here(multiline JSON).

Example-Read simple JSON file


df_json=spark.read.json("file:///path_to_file/json_tutorial_file.json")

df_json.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

Example-Read multiline JSON file: Multiline option can be used to parse one record per JSON file where one JSON may span multiple lines.


json_df=spark.read.format("json").option("multiline",True).json("file:///path_to_file/json_multiline_file.json")
json_df.show()
+-----+--------+-------+
|db_id| db_name|db_type|
+-----+--------+-------+
|   14|Teradata|  RDBMS|
+-----+--------+-------+

➠ Read Parquet file: parquet() function can be used to read data from Parquet file. File used in this example can be downloaded from here.

Example-Read simple parquet file


df_pq=spark.read.parquet("file:///path_to_file/parquet_tutorial_file.parquet")

Example-Read selective parquet files: Reading selective files from a directory using wild character


df_pq=spark.read.option("pathGlobFilter", "*.parquet").parquet("file:///path_to_directory")

df_pq.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

Example-Read parquet files Recursively: Reading files recursively from all directory and subdirectories


df_pq=spark.read.option("pathGlobFilter", "*.parquet").option("recursiveFileLookup", "true").parquet("file:///path_to_directory")

df_pq.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

➠ Read ORC file: Spark also support ORC file format which is mostly used in Hive. orc() function can be used for this purpose. File used in this example can be downloaded from here.


df_orc=spark.read.orc("file:///path_to_file/orc_tutorial_file.orc")

df_orc.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

➠ Read Avro file: Avro file format is not native to Spark and a spark-avro_x.xx-x.x.x.jar jar is required to be added in Spark library to read/write Avro files. Spark Avro jar can be downloaded from maven repository, here is the link to download spark-avro_2.12-3.0.3.jar. File used in this example can be downloaded from here.


df_avro=spark.read.format("avro").load("file:///path_to_file/avro_tutorial_file.avro")

df_avro.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

This tutorial will explain how to read various types of files(such as JSON, parquet, ORC and Avro) into Spark dataframe.

dbmstutorials.com

PySpark: File To Dataframe(Part 2)