Spark: Create RDDs

Spark's Resilient Distributed Dataset(RDD) is a fault tolerant collection of elements which can be processed in parallel. RDD can be processed in parallel because elements(dataset) are stored into multiple partitions and these partitions can be operated on in parallel.

There are 2 ways to create RDD using SparkContext(sc) in spark

Parallelize existing scala collection using 'parallelize' function
```
sc.parallelize(l)
```
Reference dataset on external storage (such as HDFS, local file system, S3, Hbase etc) using functions like 'textFile', 'sequenceFile'.
- Syntax 1: Without specifying number of partitions during reading file
```
sc.textFile(path_to_file_or_directory)
```
- Syntax 2: Specifying number of partitions during reading file
```
sc.textFile(path_to_file_or_directory, number_of_partitions)
```
RDD from dataframe

RDD Creation Examples

RDD From Plain Text File

val orderRDD =sc.textFile("retail_db/orders/part-00000");
orders: org.apache.spark.rdd.RDD[String] = retail_db/orders/part-00000 MapPartitionsRDD[60] at textFile at :24

RDD From Zip File

   val zipRDD = sc.textFile("file:///Users/Username/Downloads/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz")

Note: Spark will by default create 1 partition for zip file.

RDD From Sequence File with Partitions
```
val readSeqFile = sc.sequenceFile[Int,String]("file:///Users/Username/Desktop/spark-output/sequenceFile",4)
```
Note: Sequence file to be read must have been stored with key value pairs in order to use 'sequenceFile' function to read it.

RDD from dataframe(delimited)

var orderdf=spark.read.csv("file:///Users/Username/Desktop/data_files/order.txt")

orderdf.rdd.map(rec => rec.mkString("|*|"))
res3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at map at :26

RDD Using Core Scala Collections

RDD Using 'parallelize' function

val databaseRDD = sc.parallelize(List((2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "yugabyte")))

RDD Using 'makeRDD' function

val databaseRDD = sc.makeRDD(List((2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "yugabyte")))

Note:

RDD From File

val scalaFile = scala.io.Source.fromFile("/data/retail_db/products/part-00000").getLines.toList
val scalaFileRDD = sc.parallelize(productsRaw)

RDD From Plain List

val plainList = List("Hi", "whats going on", "Lets learn spark", "As part of this tutorial", "we will learn different ways to create RDD")

val plainListRDD = sc.parallelize(plainList);

RDD From Nested List

val enggStudent = List(
      ("IT",("Pradeep",50000.0)),
      ("CS",("CP",50200.0)),
      ("ECE",("Deepak",62200.0)),
      ("Mech",("Narpat",65000.0))
    )
val enggStudentRDD = sc.makeRDD(enggStudent)

RDD Using Range

val rangeRDD = sc.parallelize(1 to 10, 3)

RDD From Sequence

val sequenceRDD = sc.parallelize(Seq(
                                      ("Seq1", 1),
                                      ("Seq2", 2),
                                      ("Seq3", 3)
                                     )
                                )