Spark: Create RDDs

Spark: Create RDDs

Spark's Resilient Distributed Dataset(RDD) is a fault tolerant collection of elements which can be processed in parallel. RDD can be processed in parallel because elements(dataset) are stored into multiple partitions and these partitions can be operated on in parallel.


There are 2 ways to create RDD using SparkContext(sc) in spark
  1. Parallelize existing scala collection using 'parallelize' function
    sc.parallelize(l)
    
  2. Reference dataset on external storage (such as HDFS, local file system, S3, Hbase etc) using functions like 'textFile', 'sequenceFile'.
    • Syntax 1: Without specifying number of partitions during reading file
      sc.textFile(path_to_file_or_directory)
      
    • Syntax 2: Specifying number of partitions during reading file
      sc.textFile(path_to_file_or_directory, number_of_partitions)
      
  3. RDD from dataframe


RDD Creation Examples