Multiple attributes that can used within option/options function to define how read operation should behave and how contents of datasource should be interpreted.

PySpark: Dataframe Options

This tutorial will explain and list multiple attributes that can used within option/options function to define how read operation should behave and how contents of datasource should be interpreted. Most of the attributes listed below can be used in either of the function. The attributes are passed as string in option() function but not in options() function. This difference will be cleared in the example section.

Some of the below listed attributes are explained with examples at the bottom of the page.


Generic and File specific options



JDBC specific options



Kafka specific options



header example: Spark provide a way to read/write header columns as name from/into a file using option() and options() functions. File used in this example can be downloaded from here.

delimiter example: This attribute can be used to specify single / multiple character(s) as a separator for each column/field while reading or writing a file using either option() or options() functions. File used in this example can be downloaded from here.

lineSep example: This attribute can be used to specify single as a separator for each row while reading or writing a file using either option() or options() functions. Tab is used as a line separator in this example. File used in this example can be downloaded from here.

pathGlobFilter example: This attribute can be used to define pattern to read files only with filenames matching the pattern. Below example will read all the JSON files within data_files folder. File used in this example can be downloaded from here.

recursiveFileLookup example: This attribute can be used to recursively scan a directory to read files. Below example will read all the JSON files from all the sub-directories within data_files folder. File used in this example can be downloaded from here.

codec example: This attribute can be used to compress CSV or other delimited files using passed compression method. This will work only on csv or normal delimited files. Spark can read gzip without specifying codec but for writing gzip codec must be specified. compression is the synonym for codec.

quoteAll example: This attribute can be used to specify whether to quote all fields / columns or not while writing into a file.

quote example: This attribute can be used to quote fields/columns containing fields where the delimiter / separater can be part of the value. This character can be used to quote all fields when used with quoteAll option. Default value of this option is double quote(").

modifiedBefore example: This attribute can be used to read files that were modified before the specified timestamp.

modifiedAfter example: This attribute can be used to read files that were modified after the specified timestamp.

nullValue example: This attribute can be used to replace null values with the string while reading and writing dataframe. In the below example, null values in the dataframe will be replaced with ** before being written to file.

multiLine example: This attribute can be used to parse one record(this is applicable only for JSON files) per file which may span multiple lines. This attribute will also work on csv or delimited files. File used in this example can be downloaded from here.

path example: This attribute can also be used to pass file or directory path where file need to be read or written. File used in this example can be downloaded from here.

JDBC attributes example: These are examples to read from database using JDBC connection. Full table data is being fetched using "dbtable" in first example. Custom query using "query" is used in the the second example.

Kafka attributes example: These are examples to read data from Kafka topics. In the startingOffsets and endingOffsets JSON, -2 as an offset refer to earliest and -1 to latest.