PySpark: Overview and setup(Mac)
This tutorial will give high level overview of Spark and how to setup Spark/Pyspark on Mac.
Spark is an in-memory processing framework which support almost all the system ranging from HDFS to cloud storage as well.
Spark is much faster than mapreduce(Hadoop) because of below reasons:
- In-memory Processing: It processes data in the memory and keep it in the memory for further processing.
- Lazy evaluation: Execution will not start until an action is triggered.
Note: Mapreduce(Hadoop) is slow because it stores each intermediate result on disk before reading it again for further processing.
➠ Hadoop Setup:
If user want to use Hive using Spark then please complete hadoop setup first. Steps are available on Hadoop Setup
➠ Hive Setup:
If user want to use Hive using Spark then please complete hive setup first. Steps are available on Hive Setup
➠ Spark Setup:
to download Spark binary or download required version directly from apache website https://spark.apache.org/downloads.html. Place & extract the Spark package in $HOME/hadoop directory.
Also set SPARK_HOME & PATH in .profile file in Home directory(~/.profile) as shown below.
Note: Hive databases and tables can be accessed using Spark by copying hive-site.xml file from HIVE_HOME/conf to SPARK_HOME/conf.
➠ Pyspark Shell
➠ Spark Version
$ pyspark --version
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.3
Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_141
Compiled by user ubuntu on 2021-06-17T04:52:32Z