Spark's Resilient Distributed Dataset(RDD) is a fault tolerant collection of elements which can be processed in parallel. RDD can be processed in parallel because elements(dataset) are stored into multiple partitions and these partitions can be operated on in parallel.
sc.parallelize(l)
sc.textFile(path_to_file_or_directory)
sc.textFile(path_to_file_or_directory, number_of_partitions)
val orderRDD =sc.textFile("retail_db/orders/part-00000"); orders: org.apache.spark.rdd.RDD[String] = retail_db/orders/part-00000 MapPartitionsRDD[60] at textFile at:24
val zipRDD = sc.textFile("file:///Users/Username/Downloads/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz")Note: Spark will by default create 1 partition for zip file.
val readSeqFile = sc.sequenceFile[Int,String]("file:///Users/Username/Desktop/spark-output/sequenceFile",4)Note: Sequence file to be read must have been stored with key value pairs in order to use 'sequenceFile' function to read it.
var orderdf=spark.read.csv("file:///Users/Username/Desktop/data_files/order.txt") orderdf.rdd.map(rec => rec.mkString("|*|")) res3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at map at:26
val databaseRDD = sc.parallelize(List((2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "yugabyte")))
val databaseRDD = sc.makeRDD(List((2,"teradata"), (1, "vertica"),(2, "oracle"), (1, "hive"), (2, "hbase"), (1, "cassandra"), (2, "yugabyte")))
val scalaFile = scala.io.Source.fromFile("/data/retail_db/products/part-00000").getLines.toList val scalaFileRDD = sc.parallelize(productsRaw)
val plainList = List("Hi", "whats going on", "Lets learn spark", "As part of this tutorial", "we will learn different ways to create RDD") val plainListRDD = sc.parallelize(plainList);
val enggStudent = List( ("IT",("Pradeep",50000.0)), ("CS",("CP",50200.0)), ("ECE",("Deepak",62200.0)), ("Mech",("Narpat",65000.0)) ) val enggStudentRDD = sc.makeRDD(enggStudent)
val rangeRDD = sc.parallelize(1 to 10, 3)
val sequenceRDD = sc.parallelize(Seq( ("Seq1", 1), ("Seq2", 2), ("Seq3", 3) ) )