PySpark: Dataframe Schema

Pradeep

PySpark: Dataframe Schema

This tutorial will explain how to list all columns, data types or print schema of a dataframe, it will also explain how to create a new schema for reading files. Below listed topics will be explained with examples, click on item in the below list and it will take you to the respective section of the page:

Schema of a dataframe in tree format
List all Columns
Column names and datatypes of a dataframe
Schema of a dataframe
Creating a new Schema

Read a file with custom Schema

Sample Data: Dataset used in the below examples can be downloaded from here .


df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

➠ Schema of a dataframe in tree format: printSchema() function can be used on dataframe to print the schema of a dataframe on the console in a tree format.


df.printSchema()

Output: 
root
 |-- db_id: string (nullable = true)
 |-- db_name: string (nullable = true)
 |-- db_type: string (nullable = true)

➠ List all Columns: columns attribute can be used on a dataframe to return all the column names as a list.

Example 1: Getting the list of columns using column attribute.
```
df.columns

Output: ['db_id', 'db_name', 'db_type']
```

Example 2: Processing the column list and selecting partial data from the dataframe.


column_name_list = df.columns  # getting column list
column_name_list.remove('db_type') # removing unwanted column from list
df.select(column_name_list).show() # displaying data for all the remaining column

+-----+---------+
|db_id|  db_name|
+-----+---------+
|   12| Teradata|
|   14|Snowflake|
|   15|  Vertica|
|   12| Teradata|
|   22|    Mysql|
+-----+---------+

➠ Column names and datatypes of a dataframe: dtypes attribute can be used on a dataframe to return all the column names and their datatypes as a list.


df.dtypes

Output: 
[('db_id', 'string'), ('db_name', 'string'), ('db_type', 'string')]

➠ Schema of a dataframe: Pyspark stores dataframe schema as StructType object. schema attribute can be used to return the schema of a dataframe as class of "pyspark.sql.types.StructType". StructType object related functions can be used on the output of df.schema.

Example 1: schema attribute can be used on a dataframe to return schema of a dataframe as StructType object.


df.schema

Output: 
StructType(List(StructField(db_id,StringType,true), StructField(db_name,StringType,true),StructField(db_type,StringType,true)))

Example 2: Getting the list of columns AS StructField using fields attribute of a StructType object.


df.schema.fields

Output: 
[StructField(db_id,StringType,true), StructField(db_name,StringType,true), StructField(db_type,StringType,true)]

Example 3: Getting the list of columns using names attribute of a StructType object.
```
df.schema.names

Output: ['db_id', 'db_name', 'db_type']
```
Example 4: Getting the list of columns using fieldNames function of a StructType object.
```
df.schema.fieldNames()


Output: 
['db_id', 'db_name', 'db_type']
```

Example 5: Returning the list of columns and datatypes as JSON using either jsonValue() or json(). json() function will return json as string while jsonValue function will return json as json.


df.schema.jsonValue()

Output:
{'type': 'struct', 'fields': [{'name': 'db_id', 'type': 'string', 'nullable': True, 'metadata': {}}, {'name': 'db_name', 'type': 'string', 'nullable': True, 'metadata': {}}, {'name': 'db_type', 'type': 'string', 'nullable': True, 'metadata': {}}]}



df.schema.json()

Output:
'{"fields":[{"metadata":{},"name":"db_id","nullable":true,"type":"string"},{"metadata":{},"name":"db_name","nullable":true,"type":"string"},{"metadata":{},"name":"db_type","nullable":true,"type":"string"}],"type":"struct"}'

➠ Creating a new Schema: Pyspark stores dataframe schema as StructType object. add() function on StructType variable can be used to append new fields / columns to create a new Schema. add() function can take up to 4 parameters and last 3 parameters are optional.


StructType_variable.add(field, data_type=None, nullable=True, metadata=None)

1st parameter 'field' is used to specify either name of the field/column or StructField object containing data type and nullable information.
2nd parameter 'data_type' can be used to specify data type of the field. This is optional parameter and default value is 'string' if not specified.
3rd parameter 'nullable' can be used to define whether field is nullable or not. This optional parameter can take either True or False as possible value. By default, nullable parameter is True if not specified.
4th parameter 'metadata' can be used to specify some metadata information about the field. This is also optional parameter and its default value is None.

Example 1: Creating a new schema using StructType object and add() function.


from pyspark.sql.types import StructType # imported Custom Schema to read a file

schema_def = StructType()  # Created a StructType object
schema_def.add("db_id","integer",True)      # Adding column 1 to StructType
schema_def.add("db_name","string",True)     # Adding column 2 to StructType
schema_def.add("db_type_cd","string",True)  # Adding column 3 to StructType
print(schema_def)


Output:
StructType(List(StructField(db_id,IntegerType,true), StructField(db_name,StringType,true),StructField(db_type_cd,StringType,true)))

➠ Read a file with custom Schema: Custom schema created using StructType object can be used to read files. Please note when custom schema is used then it will override the file schema.

Example 1: Reading a csv file using custom schema, please note 3rd column name in csv file is "db_type" but it is being overridden by custom schema with "db_type_cd" in the below example.


from pyspark.sql.types import StructType # imported StructType

schema_def = StructType()  # Created a StructType object
schema_def.add("db_id","integer",True)      # Adding column 1 to StructType
schema_def.add("db_name","string",True)     # Adding column 2 to StructType
schema_def.add("db_type_cd","string",True)  # Adding column 3 to StructType

df_with_schema = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", schema=schema_def, header=True)

df_with_schema.printSchema()
root
 |-- db_id: integer (nullable = true)
 |-- db_name: string (nullable = true)
 |-- db_type_cd: string (nullable = true)

This tutorial will explain how to list all columns, data types or print schema of a dataframe, it will also explain how to create a new schema for reading files.

dbmstutorials.com

PySpark: Dataframe Schema