This tutorial will explain some of the common operations (such as count check, restrict dataframe rows) that can performed on the dataframe. Below listed dataframe functions will be explained with examples, click on function name in the below list and it will take you to the respective section of the function:
df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header=True)
df.show()
+-----+---------+-------+
|db_id| db_name|db_type|
+-----+---------+-------+
| 12| Teradata| RDBMS|
| 14|Snowflake|CloudDB|
| 15| Vertica| RDBMS|
| 12| Teradata| RDBMS|
| 22| Mysql| RDBMS|
+-----+---------+-------+
df.count()
Output: 5
df.select("db_type").count()
Output: 5
df.distinct().count()
Output: 4
df.select("db_type").distinct().count()
Output: 2
df.rdd.isEmpty()
Output: False
df_empty = df.filter("1==2") # creating empty dataframe
df_empty.rdd.isEmpty()
Output: True
df.count()==0
Output: False
df_empty = df.filter("1==2") # creating empty dataframe
df_empty.count()==0
Output: True
df.select("db_type").distinct().count()==0
Output: False
df.count()
Output: 5
df.limit(2).count()
Output: 2
df.printSchema()
Output:
root
|-- db_id: string (nullable = true)
|-- db_name: string (nullable = true)
|-- db_type: string (nullable = true)