PySpark: Overview and setup(Mac)
This tutorial will give high level overview of Spark and how to setup Spark/Pyspark on Mac.
Spark is an in-memory processing framework which support almost all the system ranging from HDFS to cloud storage as well.
Spark is much faster than mapreduce(Hadoop) because of below reasons:
- In-memory Processing: It processes data in the memory and keep it in the memory for further processing.
- Lazy evaluation: Execution will not start until an action is triggered.
Note: Mapreduce(Hadoop) is slow because it stores each intermediate result on disk before reading it again for further processing.
➠
Hadoop Setup: If user want to use Hive using Spark then please complete hadoop setup first. Steps are available on
Hadoop Setup Page.
➠
Hive Setup: If user want to use Hive using Spark then please complete hive setup first. Steps are available on
Hive Setup Page.
PySpark setup(Mac)
➠
Spark Setup: Click
here to download Spark binary or download required version directly from apache website https://spark.apache.org/downloads.html. Place & extract the Spark package in $HOME/hadoop directory.
Also set SPARK_HOME & PATH in .profile file in Home directory(~/.profile) as shown below.
##.profile
export SPARK_HOME=$HOME/hadoop/spark-3.1.3-bin-hadoop3.2
export PATH=$PATH:$SPARK_HOME/bin
Note: Hive databases and tables can be accessed using Spark by copying hive-site.xml file from HIVE_HOME/conf to SPARK_HOME/conf.
➠
Pyspark Shell
$ pyspark
➠
Spark Version
$ pyspark --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.3
/_/
Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_141
Branch HEAD
Compiled by user ubuntu on 2021-06-17T04:52:32Z
Revision 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8
Url https://github.com/apache/spark