This tutorial will explain how to list all columns, data types or print schema of a dataframe, it will also explain how to create a new schema for reading files.

PySpark: Dataframe Schema

This tutorial will explain how to list all columns, data types or print schema of a dataframe, it will also explain how to create a new schema for reading files. Below listed topics will be explained with examples, click on item in the below list and it will take you to the respective section of the page:


Schema of a dataframe in tree format: printSchema() function can be used on dataframe to print the schema of a dataframe on the console in a tree format.
List all Columns: columns attribute can be used on a dataframe to return all the column names as a list.
Column names and datatypes of a dataframe: dtypes attribute can be used on a dataframe to return all the column names and their datatypes as a list.
Schema of a dataframe: Pyspark stores dataframe schema as StructType object. schema attribute can be used to return the schema of a dataframe as class of "pyspark.sql.types.StructType". StructType object related functions can be used on the output of df.schema.
Creating a new Schema: Pyspark stores dataframe schema as StructType object. add() function on StructType variable can be used to append new fields / columns to create a new Schema. add() function can take up to 4 parameters and last 3 parameters are optional.


Read a file with custom Schema: Custom schema created using StructType object can be used to read files. Please note when custom schema is used then it will override the file schema.