PySpark: Dataframe Caching

This tutorial will explain various function available in Pyspark to cache a dataframe and to clear cache of an already cached dataframe. A cache is a data storage layer (memory) in computing which stores a subset of data, so that future requests for the same data are served up faster than is possible by accessing the data’s original source. Caching allows you to efficiently reuse previously fetched or processed data.

Following functions / topic will be covered on this page, click on item in the below list and it will take you to the respective section of the page:

StorageLevel Function
storageLevel Attribute
is_cached
cache
persist
unpersist

isLocal

Sample Data: Dataset used in the below examples can be downloaded from here.


df = spark.read.csv("file:///path_to_files/csv_with_duplicates_and_nulls.csv",header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|   null|
|   50|Snowflake|  RDBMS|
|   51|     null|CloudDB|
+-----+---------+-------+

➠ StorageLevel Function: StorageLevel function (within Pyspark library) can be used along with "persist" function to tell spark how to cache data. This includes whether to store data on disk if it does not completely fit into memory or not. Also if cache data should be replicated on the multiple nodes.

Syntax:


StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)

→ 1st parameter takes a boolean value to specify whether to cache data on the disk or not.
→ 2nd parameter takes a boolean value to specify whether to cache data in the memory or not.
→ 3rd parameter takes a boolean value to specify whether to use offHeap or not to cache data.
→ 4th parameter takes a boolean value to specify whether to deserialize data or not. By default, data is always serialized on the Python side.
→ 5th parameter takes an integer value to specify replication factor i.e. data should be cached on how many nodes. By default, replication factor is 1.

Below table list StorageLevel function constants which are equivalent to corresponding parameters.

Static Constants	Equivalent StorageLevel Value
DISK_ONLY	StorageLevel(True, False, False, False, 1)
DISK_ONLY_2	StorageLevel(True, False, False, False, 2)
DISK_ONLY_3	StorageLevel(True, False, False, False, 3)
MEMORY_AND_DISK	StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_2	StorageLevel(True, True, False, False, 2)
MEMORY_ONLY	StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_2	StorageLevel(False, True, False, False, 2)
OFF_HEAP	StorageLevel(True, True, True, False, 1)

➠ storageLevel Attribute: storageLevel attribute can be used to get current storage level of a dataframe.

Example 1: This is the example where dataframe is not cached. It can be identified as all the cache related flags in StorageLevel are False for this dataframe.
```
df.storageLevel

Output: StorageLevel(False, False, False, False, 1)
```
Example 2: This is the example where dataframe is cached. It can be identified as useDisk, useMemory, deserialized parameters in StorageLevel are True for this dataframe
```
df.storageLevel

Output: StorageLevel(True, True, False, True, 1)
```

➠ is_cached: This dataframe attribute can be used to know whether dataframe is cached or not. Output will be True if dataframe is cached else False.

Example 1: If dataframe is not cached then it will return False.
```
df.is_cached

Output: False
```
Example 2: If dataframe is cached then it will return True.
```
df.is_cached

Output: True
```

➠ cache: Cache function can be used to cache data with a preset StorageLevel. Current default storage level is "MEMORY_AND_DISK" for cache function.


df.cache()


df.is_cached   #Dataframe is cached

Output: True

➠ persist: Persist function is similar to cache function but user can specify the storage level using storageLevel Function as parameter. If no parameter is passed to this function then it will save with default storage level.

Example 1:


df.persist()


df.storageLevel

Output: StorageLevel(True, True, False, False, 1)

Example 2: MEMORY_ONLY cache using parameters.


import pyspark

df.persist(pyspark.StorageLevel(False, True, False, False, 1))


df.storageLevel

Output: StorageLevel(False, True, False, False, 1)

Example 3: MEMORY_ONLY cache using static constant.


import pyspark

df.persist(pyspark.StorageLevel.MEMORY_ONLY)


df.storageLevel

Output: StorageLevel(False, True, False, False, 1)

Example 4: MEMORY_AND_DISK_DESER cache using parameters.


import pyspark

df.persist(pyspark.StorageLevel(True, True, False, True, 1))


df.storageLevel

Output: StorageLevel(True, True, False, True, 1)

➠ unpersist: Unpersist function can be used to clear cache(to remove all blocks from memory and disk) of an already cached dataframe.


df.unpersist()

DataFrame[db_id: string, db_name: string, db_type: string]



df.is_cached  #Dataframe cache is cleared

Output: False

➠ isLocal: isLocal function will returns true if the Collect() and Take() methods can be run locally without any Spark executors else it will return false.

```
df.isLocal()

Output: False
```

This tutorial will explain various function available in Pyspark to cache a dataframe and to clear cache of an already cached dataframe.

dbmstutorials.com

PySpark: Dataframe Caching