PySpark: File To Dataframe(Part 1)

Pradeep

PySpark: File To Dataframe(Part 1)

This tutorial will explain how to read various types of comma separated value(CSV) files or other delimited files into Spark dataframe.

DataframeReader "spark.read" can be used to import data into Spark dataframe from csv file(s).
Default delimiter for CSV function in spark is comma(,).
By default, Spark will create as many number of partitions in dataframe as number of files in the read path.
repartition() function can be used to increase the number of partition in dataframe while reading files.
options() or option() function can be used to alter the behavior of read operation. you can to visit dataframe options page to understand various options available to use along with CSV function to alter read operation behaviour.
Following topics will be covered on this page:

➠ Read CSV file(without header): Users can click here to download file used in the these examples.

CSV File with default column name: When a CSV file is read without header and toDF() function is not used to assign column names then the columns created will be named as _c0, _c1 till _cn (n represent number of columns).


df = spark.read.csv("file:///path_to_file/tutorial_file.txt")

df.show()
+---+---------+-------+
|_c0|      _c1|    _c2|
+---+---------+-------+
| 12| Teradata|  RDBMS|
| 14|Snowflake|CloudDB|
| 15|  Vertica|  RDBMS|
| 17|   Oracle|  RDBMS|
| 19|  MongoDB|  NOSQL|
+---+---------+-------+

CSV File with assigned column name: If CSV file is read without header then toDF() function can be used to assign column names to the fields.


df = spark.read.csv("file:///path_to_file/tutorial_file.txt").toDF("db_id", "db_name","db_type")

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

➠ CSV file(with header): Spark provides a way to read header columns as name from a file using either option() or options() functions. Options function is used in the below example. Users can click here to download file used in this example.


df=spark.read.options(header=True).csv("file:///path_to_file/tutorial_file_with_header.txt")

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
+-----+---------+-------+

➠ Read Delimited file: Although CSV files are also delimited files, these examples are separately mentioned here to read delimited files with customized separator i.e delimiter other than comma(,).

Tab delimited file: Many people also choose tab as delimiter for delimited files. Users can click here to download file used in this example.


df=spark.read.options(delimiter="\t").csv("file:///path_to_file/tab_delimited_file.txt")

df.show()
+---+---------+-------+
|_c0|      _c1|    _c2|
+---+---------+-------+
| 12| Teradata|  RDBMS|
| 14|Snowflake|CloudDB|
| 15|  Vertica|  RDBMS|
| 17|   Oracle|  RDBMS|
| 19|  MongoDB|  NOSQL|
| 21|      Tab|   Test|
+---+---------+-------+

Pipe delimited file: Pipe is one of the other common delimiter used for delimited files. Users can click here to download file used in this example.


df=spark.read.options(header=True, delimiter="|").csv("file:///path_to_file/pipe_delimited_file.txt")

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
|   21|     Pipe|   Test|
+-----+---------+-------+

Control(Ctrl) A delimited file: Spark can also read control characters such as CTRL A as delimiter. Users can click here to download file used in this example.


cat -v /path_to_file/ctrl_a_delimited_file.txt 
**********File content**********
db_id^Adb_name^Adb_type
12^ATeradata^ARDBMS
14^ASnowflake^ACloudDB
********************************


df=spark.read.options(header=True, delimiter="\01").csv("file:///path_to_file/ctrl_a_delimited_file.txt")

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   17|   Oracle|  RDBMS|
|   19|  MongoDB|  NOSQL|
|   21|    CtrlA|   Test|
+-----+---------+-------+

Multicharacter delimited file: Spark now supports Multicharacter delimiter to read and write files. Users can click here to download file used in this example.


cat /path_to_file/multichar_delimited_file.txt 
**********File content**********
db_id|*|db_name|*|db_type
12|*|Teradata|*|RDBMS
14|*|Snowflake|*|CloudDB
********************************


df=spark.read.options(header=True, delimiter="|*|").csv("file:///path_to_file/multichar_delimited_file.txt")

df.show()
+-----+----------------+-------+
|db_id|         db_name|db_type|
+-----+----------------+-------+
|   12|        Teradata|  RDBMS|
|   14|       Snowflake|CloudDB|
|   15|         Vertica|  RDBMS|
|   17|          Oracle|  RDBMS|
|   19|         MongoDB|  NOSQL|
|   21|PipeAsteriskPipe|   Test|
+-----+----------------+-------+

Delimited file with new line characters: User can use multiline option to read data containing newline characters. Please note that entire column data containing newline character must be enclosed within double quote for this to work.


df_multiline = spark.read.option("delimiter","|").option("header",True).option("Multiline",True).csv("file:///path_to_directory/multiline_file_with_header.txt")

df_multiline.show()
+-----+---------+--------+
|db_id|  db_name| db_type|
+-----+---------+--------+
|   12| Teradata|RDBMS
DB|
|   14|Snowflake|
CloudDB|
|   15|  Vertica|   RDBMS|
|   17|Oracle
DB|   RDBMS|
|   19|  MongoDB|   NOSQL|
+-----+---------+--------+


--Replacing new line character with **
df_multiline.select("db_id",f.regexp_replace(f.col("db_name"),"\n","**").alias("db_name"),f.regexp_replace(f.col("db_type"),"\n","**").alias("db_type")).show()
+-----+----------+---------+
|db_id|   db_name|  db_type|
+-----+----------+---------+
|   12|  Teradata|RDBMS**DB|
|   14| Snowflake|**CloudDB|
|   15|   Vertica|    RDBMS|
|   17|Oracle**DB|    RDBMS|
|   19|   MongoDB|    NOSQL|
+-----+----------+---------+

➠ Read CSV from HDFS: Spark can also read data from HDFS system. As such there is no syntax difference in reading from Local or HDFS, only difference will be the path difference.


df = spark.read.csv("hdfs://localhost:9000/user/hive/warehouse/retail.db/orders")

df.show(5,truncate=False)
+-----+---------------------+-----+---------------+
|_c0  |_c1                  |_c2  |_c3            |
+-----+---------------------+-----+---------------+
|68850|2014-05-25 00:00:00.0|8451 |COMPLETE       |
|68851|2014-05-26 00:00:00.0|11193|PENDING_PAYMENT|
|68852|2014-05-29 00:00:00.0|4596 |CLOSED         |
|68853|2014-05-31 00:00:00.0|1202 |PENDING_PAYMENT|
|68854|2014-06-01 00:00:00.0|6528 |ON_HOLD        |
+-----+---------------------+-----+---------------+

➠ Read Multiple CSV Files: Users can click here(File 1) and here(File 2) to download files used in these example.

Read CSV From Directory(containing only CSV files): Users can provide the directory path of the CSV from where multiple files need to be read. This will fail if directory will have files other than csv. In the below example, csv folder contain only .csv files


df = spark.read.csv("file:///path_to_file/data_files/csv")

df.show()
+---+-----------+-------+
|_c0|        _c1|    _c2|
+---+-----------+-------+
| 12|   Teradata|  RDBMS|
| 14|  Snowflake|CloudDB|
| 15|    Vertica|  RDBMS|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 21|SingleStore|  RDBMS|
| 22|      Mysql|  RDBMS|
+---+-----------+-------+

Read CSV From Directory(containing files with heterogenous formats): Users can use pathGlobFilter option to read only CSV files based on pattern from the directory where multiple files can be present with other formats as well. In the below example, data_files folder contain files with different formats.


df = spark.read.options(pathGlobFilter= "*.csv").csv("file:///path_to_file/data_files")

df.show()
+---+-----------+-------+
|_c0|        _c1|    _c2|
+---+-----------+-------+
| 12|   Teradata|  RDBMS|
| 14|  Snowflake|CloudDB|
| 15|    Vertica|  RDBMS|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 21|SingleStore|  RDBMS|
| 22|      Mysql|  RDBMS|
+---+-----------+-------+

Recursively Read All CSV files: Users can use recursiveFileLookup option to scan all the subdirectories for the CSV files. pathGlobFilter can be used with recursive option to ignore files other than CSV files.


df = spark.read.options(pathGlobFilter= "*.csv", recursiveFileLookup=True).csv("file:///path_to_file/data_files")

df.show()
+---+-----------+-------+
|_c0|        _c1|    _c2|
+---+-----------+-------+
| 12|   Teradata|  RDBMS|
| 14|  Snowflake|CloudDB|
| 15|    Vertica|  RDBMS|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 12|   Teradata|  RDBMS|
| 14|  Snowflake|CloudDB|
| 15|    Vertica|  RDBMS|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 21|SingleStore|  RDBMS|
| 22|      Mysql|  RDBMS|
| 17|     Oracle|  RDBMS|
| 19|    MongoDB|  NOSQL|
| 21|SingleStore|  RDBMS|
| 22|      Mysql|  RDBMS|
+---+-----------+-------+

Read a file with custom Schema: Custom schema created using StructType object can be used to read files. Please note when custom schema is used then it will override the file schema.

Example 1: Reading a csv file using custom schema, please note 3rd column name in csv file is "db_type" but it is being overridden by custom schema with "db_type_cd" in the below example.


from pyspark.sql.types import StructType # imported StructType

schema_def = StructType()  # Created a StructType object
schema_def.add("db_id","integer",True)      # Adding column 1 to StructType
schema_def.add("db_name","string",True)     # Adding column 2 to StructType
schema_def.add("db_type_cd","string",True)  # Adding column 3 to StructType

df_with_schema = spark.read.csv("file:///path_to_files/tutorial_file_with_header.csv", schema=schema_def, header=True)

df_with_schema.printSchema()
root
 |-- db_id: integer (nullable = true)
 |-- db_name: string (nullable = true)
 |-- db_type_cd: string (nullable = true)

This tutorial will explain how to read various types of comma separate(CSV) files or other delimited files into Spark dataframe.

dbmstutorials.com

PySpark: File To Dataframe(Part 1)