This tutorial will explain various function available in Pyspark to cache a dataframe and to clear cache of an already cached dataframe.. A cache is a data storage layer in computing which stores a subset of data, so that future requests for the same data are served up faster than is possible by accessing the data’s original source. Caching allows you to efficiently reuse previously fetched or processed data.
Following functions / topic will be covered on this page, click on item in the below list and it will take you to the respective section of the page:
df = spark.read.csv("file:///path_to_files/csv_with_duplicates_and_nulls.csv",header=True)
df.show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 14|Snowflake|CloudDB|
| 15| Vertica| RDBMS|
| 12| Teradata| RDBMS|
| 22| Mysql| null|
| 50|Snowflake| RDBMS|
| 51| null|CloudDB|
+-----+---------+-------+
StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication=1)
Static Constants |
Equivalent StorageLevel Value |
DISK_ONLY |
StorageLevel(True, False, False, False, 1) |
DISK_ONLY_2 |
StorageLevel(True, False, False, False, 2) |
DISK_ONLY_3 |
StorageLevel(True, False, False, False, 3) |
MEMORY_AND_DISK |
StorageLevel(True, True, False, False, 1) |
MEMORY_AND_DISK_2 |
StorageLevel(True, True, False, False, 2) |
MEMORY_ONLY |
StorageLevel(False, True, False, False, 1) |
MEMORY_ONLY_2 |
StorageLevel(False, True, False, False, 2) |
OFF_HEAP |
StorageLevel(True, True, True, False, 1) |
df.storageLevel
Output: StorageLevel(False, False, False, False, 1)
df.storageLevel
Output: StorageLevel(True, True, False, True, 1)
df.is_cached
Output: False
df.is_cached
Output: True
df.cache()
df.is_cached #Dataframe is cached
Output: True
df.persist()
df.storageLevel
Output: StorageLevel(True, True, False, False, 1)
import pyspark
df.persist(pyspark.StorageLevel(False, True, False, False, 1))
df.storageLevel
Output: StorageLevel(False, True, False, False, 1)
import pyspark
df.persist(pyspark.StorageLevel.MEMORY_ONLY)
df.storageLevel
Output: StorageLevel(False, True, False, False, 1)
import pyspark
df.persist(pyspark.StorageLevel(True, True, False, True, 1))
df.storageLevel
Output: StorageLevel(True, True, False, True, 1)
df.unpersist()
DataFrame[db_id: string, db_name: string, db_type: string]
df.is_cached #Dataframe cache is cleared
Output: False