PySpark: Dataframe Split

Pradeep

PySpark: Dataframe Split

This tutorial will explain the functions available in Pyspark to split/break dataframe into n smaller dataframes depending on the approximate weight percentage passed using the appropriate parameter.

Due to the random nature of the randomSplit() transformation, Spark does not guaranteed that it will return exactly the specified fraction (weights) of the total number of rows in a given dataframe.
Splitting the dataframe will not result in the shuffle partitions i.e. number of partitions in the target dataframes will be same as in the original dataframe.
There have been reports of inconsistent behaviour of randomSplit because of recomputing on a non-deterministic dataframe, more detail on this page . One of the following solutions can be used to avoid this problem:
1. Cache the dataframe before splitting the dataframe.
2. Repartition the dataframe by column(s).
Syntax: This function takes 2 parameter, 1st parameter is mandatory but 2nd parameter is optional.
```
randomSplit(weights, seed=None)
```
- → The first parameter (weights) takes a list of decimal values (range from 0.0 to 1.0) to split dataframe into smaller dataframes, the number of split dataframes will be equal to the elements present in the weights list. If the sum of Weights is greater than 1.0 then it will be normalized, you can visit this page to know how normalization works.
- → The second parameter (seed) takes the numeric value as input and with the same seed number, it will return the same output for each run otherwise it will be a random output.
Sample Data: Dataset used in the below examples can be downloaded from here .
```
deptdf = spark.read.parquet("file:///path_to_files/department.parquet")

deptdf.show()
+-------+--------------------+-----------+
|dept_no|     department_name|   loc_name|
+-------+--------------------+-----------+
|    400|INFORMATION TECHN...|BHUBANESWAR|
|    300|               SALES|  BENGALURU|
|    100|            ACCOUNTS|     JAIPUR|
|    200|               R & D|  NEW DELHI|
+-------+--------------------+-----------+
```

➠ Example 1: 2 dataframes will be returned in the below example as 2 decimal values were passed in the list for 'weights' parameter. Since seed parameter is not used, each run may give different split dataframes.

Run 1:
======

a,b=deptdf.randomSplit([0.5,.5])

a.show()
+-------+---------------+---------+
|dept_no|department_name| loc_name|
+-------+---------------+---------+
|    300|          SALES|BENGALURU|
|    100|       ACCOUNTS|   JAIPUR|
|    200|          R & D|NEW DELHI|
+-------+---------------+---------+

b.show()
+-------+--------------------+-----------+
|dept_no|     department_name|   loc_name|
+-------+--------------------+-----------+
|    400|INFORMATION TECHN...|BHUBANESWAR|
+-------+--------------------+-----------+


Run 2:
======

a,b=deptdf.randomSplit([0.5,.5])

a.show()
+-------+---------------+--------+
|dept_no|department_name|loc_name|
+-------+---------------+--------+
|    100|       ACCOUNTS|  JAIPUR|
+-------+---------------+--------+

b.show()
+-------+--------------------+-----------+
|dept_no|     department_name|   loc_name|
+-------+--------------------+-----------+
|    400|INFORMATION TECHN...|BHUBANESWAR|
|    300|               SALES|  BENGALURU|
|    200|               R & D|  NEW DELHI|
+-------+--------------------+-----------+

➠ Example 2: With the same seed value, spark will return the same split dataframes in all the run.

Run 1:
======

a,b=deptdf.randomSplit([0.5,.5],1)

b.show()
+-------+--------------------+-----------+
|dept_no|     department_name|   loc_name|
+-------+--------------------+-----------+
|    400|INFORMATION TECHN...|BHUBANESWAR|
|    100|            ACCOUNTS|     JAIPUR|
+-------+--------------------+-----------+

a.show()
+-------+---------------+---------+
|dept_no|department_name| loc_name|
+-------+---------------+---------+
|    300|          SALES|BENGALURU|
|    200|          R & D|NEW DELHI|
+-------+---------------+---------+



Run 2:
======

a,b=deptdf.randomSplit([0.5,.5],1)

b.show()
+-------+--------------------+-----------+
|dept_no|     department_name|   loc_name|
+-------+--------------------+-----------+
|    400|INFORMATION TECHN...|BHUBANESWAR|
|    100|            ACCOUNTS|     JAIPUR|
+-------+--------------------+-----------+

a.show()
+-------+---------------+---------+
|dept_no|department_name| loc_name|
+-------+---------------+---------+
|    300|          SALES|BENGALURU|
|    200|          R & D|NEW DELHI|
+-------+---------------+---------+

This tutorial will explain the functions available in Pyspark to split/break dataframe into n smaller dataframes depending on the approximate weight percentage passed using the appropriate parameter.

dbmstutorials.com

PySpark: Dataframe Split