PySpark: Dataframe Basic Operations

Pradeep

PySpark: Dataframe Basic Operations

This tutorial will explain some of the common operations (such as count check, restrict dataframe rows) that can performed on the dataframe. Below listed dataframe functions will be explained with examples, click on function name in the below list and it will take you to the respective section of the function:

Count
isEmpty
Limit

PrintSchema

Sample Data: Dataset used in the below examples can be downloaded from here .


df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header=True)

df.show()
+-----+---------+-------+
|db_id|  db_name|db_type|
+-----+---------+-------+
|   12| Teradata|  RDBMS|
|   14|Snowflake|CloudDB|
|   15|  Vertica|  RDBMS|
|   12| Teradata|  RDBMS|
|   22|    Mysql|  RDBMS|
+-----+---------+-------+

➠ Count: count() function can be used to count number of rows present in dataframe.

Example 1: Normal dataframe row count operation.
```
df.count()


Output: 5
```
Example 2: Count values for a single column named "db_type" in the below example.
```
df.select("db_type").count()


Output: 5
```
Example 3: Count distinct rows in dataframe using distinct function along with count() function.
```
df.distinct().count()


Output: 4
```
Example 4: Count distinct values for a single column named "db_type" in dataframe using select function, distinct function along with count() function.
```
df.select("db_type").distinct().count()


Output: 2
```

➠ isEmpty: To conditionally run some operations, there will be requirements to check whether dataframe is empty or not. It can be determined using count() function or rdd's isEmpty() function.

Example 1: By using RDD's isEmpty() function on dataframe, example is showing operation on non empty dataframe.
```
df.rdd.isEmpty()

Output: False
```
Example 2: By using RDD's isEmpty() function on dataframe, example is showing operation on empty dataframe.
```
df_empty = df.filter("1==2")  # creating empty dataframe
df_empty.rdd.isEmpty()

Output: True
```
Example 3: By using count() function on dataframe, example is showing operation on non empty dataframe.
```
df.count()==0

Output: False
```
Example 4: By using count() function on dataframe, example is showing operation on empty dataframe.
```
df_empty = df.filter("1==2")  # creating empty dataframe
df_empty.count()==0

Output: True
```
Example 5: Check if particular column within dataframe is empty, example is checking on "db_type" column.
```
df.select("db_type").distinct().count()==0


Output: False
```

➠ Limit: limit() function can be used to restrict number of rows in a dataframe. This function takes number as parameter to restrict that many rows in dataframe.

Example 1: limit() function was used to restrict number of rows to 2 in the below example.
```
df.count()
Output: 5


df.limit(2).count()
Output: 2
```

➠ PrintSchema: printSchema() function can be used on dataframe to print schema of the dataframe to the console in a tree format.


df.printSchema()

Output: 
root
 |-- db_id: string (nullable = true)
 |-- db_name: string (nullable = true)
 |-- db_type: string (nullable = true)

This tutorial will explain some of the common operations (such as count check, restrict dataframe rows) that can performed on the dataframe.

dbmstutorials.com

PySpark: Dataframe Basic Operations