PySpark: Overview and setup

Pradeep

PySpark: Overview and setup(Mac)

This tutorial will give high level overview of Spark and how to setup Spark/Pyspark on Mac. Spark is an in-memory processing framework which support almost all the system ranging from HDFS to cloud storage as well. Spark is much faster than mapreduce(Hadoop) because of below reasons:

In-memory Processing: It processes data in the memory and keep it in the memory for further processing.
Lazy evaluation: Execution will not start until an action is triggered.

Note: Mapreduce(Hadoop) is slow because it stores each intermediate result on disk before reading it again for further processing.

➠ Hadoop Setup: If user want to use Hive using Spark then please complete hadoop setup first. Steps are available on Hadoop Setup Page.

➠ Hive Setup: If user want to use Hive using Spark then please complete hive setup first. Steps are available on Hive Setup Page.

PySpark setup(Mac)

➠ Spark Setup: Click here to download Spark binary or download required version directly from apache website https://spark.apache.org/downloads.html. Place & extract the Spark package in $HOME/hadoop directory.

##.profile
export SPARK_HOME=$HOME/hadoop/spark-3.1.3-bin-hadoop3.2
export PATH=$PATH:$SPARK_HOME/bin

Note:

➠ Pyspark Shell


$ pyspark

➠ Spark Version


$ pyspark --version


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.3
      /_/
                        
Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_141
Branch HEAD
Compiled by user ubuntu on 2021-06-17T04:52:32Z
Revision 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8
Url https://github.com/apache/spark

This tutorial will give high level overview of Spark and how to setup Spark/Pyspark on Mac.

dbmstutorials.com

PySpark: Overview and setup(Mac)

PySpark setup(Mac)