This tutorial will explain the functions available in Pyspark to split/break dataframe into n smaller dataframes depending on the approximate weight percentage passed using the appropriate parameter.
randomSplit(weights, seed=None)
deptdf = spark.read.parquet("file:///path_to_files/department.parquet")
deptdf.show()
+-------+--------------------+-----------+
|dept_no| department_name| loc_name|
+-------+--------------------+-----------+
| 400|INFORMATION TECHN...|BHUBANESWAR|
| 300| SALES| BENGALURU|
| 100| ACCOUNTS| JAIPUR|
| 200| R & D| NEW DELHI|
+-------+--------------------+-----------+
Run 1:
======
a,b=deptdf.randomSplit([0.5,.5])
a.show()
+-------+---------------+---------+
|dept_no|department_name| loc_name|
+-------+---------------+---------+
| 300| SALES|BENGALURU|
| 100| ACCOUNTS| JAIPUR|
| 200| R & D|NEW DELHI|
+-------+---------------+---------+
b.show()
+-------+--------------------+-----------+
|dept_no| department_name| loc_name|
+-------+--------------------+-----------+
| 400|INFORMATION TECHN...|BHUBANESWAR|
+-------+--------------------+-----------+
Run 2:
======
a,b=deptdf.randomSplit([0.5,.5])
a.show()
+-------+---------------+--------+
|dept_no|department_name|loc_name|
+-------+---------------+--------+
| 100| ACCOUNTS| JAIPUR|
+-------+---------------+--------+
b.show()
+-------+--------------------+-----------+
|dept_no| department_name| loc_name|
+-------+--------------------+-----------+
| 400|INFORMATION TECHN...|BHUBANESWAR|
| 300| SALES| BENGALURU|
| 200| R & D| NEW DELHI|
+-------+--------------------+-----------+
Run 1:
======
a,b=deptdf.randomSplit([0.5,.5],1)
b.show()
+-------+--------------------+-----------+
|dept_no| department_name| loc_name|
+-------+--------------------+-----------+
| 400|INFORMATION TECHN...|BHUBANESWAR|
| 100| ACCOUNTS| JAIPUR|
+-------+--------------------+-----------+
a.show()
+-------+---------------+---------+
|dept_no|department_name| loc_name|
+-------+---------------+---------+
| 300| SALES|BENGALURU|
| 200| R & D|NEW DELHI|
+-------+---------------+---------+
Run 2:
======
a,b=deptdf.randomSplit([0.5,.5],1)
b.show()
+-------+--------------------+-----------+
|dept_no| department_name| loc_name|
+-------+--------------------+-----------+
| 400|INFORMATION TECHN...|BHUBANESWAR|
| 100| ACCOUNTS| JAIPUR|
+-------+--------------------+-----------+
a.show()
+-------+---------------+---------+
|dept_no|department_name| loc_name|
+-------+---------------+---------+
| 300| SALES|BENGALURU|
| 200| R & D|NEW DELHI|
+-------+---------------+---------+