This tutorial will explain various function available in Pyspark to cache a dataframe and to clear cache of an already cached dataframe.

PySpark: Dataframe Caching

This tutorial will explain various function available in Pyspark to cache a dataframe and to clear cache of an already cached dataframe. A cache is a data storage layer (memory) in computing which stores a subset of data, so that future requests for the same data are served up faster than is possible by accessing the data’s original source. Caching allows you to efficiently reuse previously fetched or processed data.

Following functions / topic will be covered on this page, click on item in the below list and it will take you to the respective section of the page:

StorageLevel Function: StorageLevel function (within Pyspark library) can be used along with "persist" function to tell spark how to cache data. This includes whether to store data on disk if it does not completely fit into memory or not. Also if cache data should be replicated on the multiple nodes.
storageLevel Attribute: storageLevel attribute can be used to get current storage level of a dataframe.
is_cached: This dataframe attribute can be used to know whether dataframe is cached or not. Output will be True if dataframe is cached else False.
cache: Cache function can be used to cache data with a preset StorageLevel. Current default storage level is "MEMORY_AND_DISK" for cache function.
persist: Persist function is similar to cache function but user can specify the storage level using storageLevel Function as parameter. If no parameter is passed to this function then it will save with default storage level.
unpersist: Unpersist function can be used to clear cache(to remove all blocks from memory and disk) of an already cached dataframe.
isLocal: isLocal function will returns true if the Collect() and Take() methods can be run locally without any Spark executors else it will return false.