Setting up a multi-node Apache Spark Cluster on a Laptop

Software

Publish Date: 2021-06-26

Article Word Count: 1.2k

Reading Time: 7 Min

For Feedbacks | Enquiries | Questions | Comments - Contact us @ innovationmerge@gmail.com

Prerequisite

Understand how to install Ubuntu inside Windows using Oracle VM VirtualBox from this Link

What?

Apache Spark is a fast and general purpose cluster computing system. It is used as a lightning fast unified analytics engine for bigdata & Machine Learning applications.
Apache Spark is an Engine
- To process data in real time & batch mode
- To respond in subsecond
- To perform in Memory processing
According to Spark documentation it is an alternative to Hadoop Map Reduce
- 10-100 percent faster than Hadoop
- 10 Times faster than accessing data from disk
- Very fast speed and ease of use
It provides high level API’s in JAVA, Scala, Python & R
This article explains how to setup and use Apache Spark in a multi node cluster environment.

Why?

Apache Spark is used for distributed, in memory processing and it is popular due to below offerings
- Spark Core - Allows to consume and process Batch data
- Spark Streaming - Allows to consume and process continous data streams
- Clients - Interactive Processing
- Spark SQL - Allows to use SQL Queries for structured data processing
- MLib - Machine Learning library that delivers high quality algorithms
- GraphX - It is a library for Graph Processing
Apache spark support multiple resource manager
- Standalone - It is a basic cluster manager that comes with spark compute engine. It provides basic funcationalities like Memory management, Fault recovery, Task Scheduling, Interaction with cluster manager
- Apache YARN - It is the cluster manager for Hadoop
- Apache Mesos - It is another general purpose cluster manager
- Kubernetes - It is a general purpose container orchestration platform
Every developer needs an local environment to run and test the Spark application. This article explaines detailed steps in setting up the multinode Apache Spark Cluster.

How?

When an application is submitted to Spark, it will create one driver process and multiple executer process for application on multiple nodes.
Entire set of Driver & Executors are available for application.
Driver is responsible for Analysing, Distributing, Scheduling, Monitoring and maintaining all the information during the lifetime of the application.
Each node can have multiple executor’s. Executer’s are responsible for executing the code assigned to them by Driver and reports the status back to the Driver.

Hardware’s Required

The steps of this article are tested with below Laptop configuration
RAM : 12GB DDR4
Processor : Intel Core i5, 4 cores
Graphic Processor : NVIDIA GeForce 940MX

Software’s Required

Network Requirements

Internet to download packages

Implementation

Create Master and Worker Nodes

Create below configuration Machine’s in VirtualBox(Refer Prerequisite section)
Master - 2 vCPU, 3GB RAM, Ubuntu OS
Node1 - 2 vCPU, 1GB RAM, Ubuntu OS
Node2 - 2 vCPU, 1GB RAM, Ubuntu OS

Create Host-only network

Apache spark needs Static IP address to communicate between nodes.
VitualBox has Host-only network mode for communicating between a host and guests.
In simple words nodes created with this network mode can communicate between each other and The VirtualBox host machine(Laptop) can acess all Vm’s connected to the host-only network.
In VirutalBox software Navigate File -> Host Network Manager
Click on Create -> Configure Adapter manually with IPv4 Address : 192.168.56.1 and Network Mask : 255.255.255.0

Assign Host-only network for Master and worker nodes

Select the machine -> Settings
Navigate to Network -> Adapter1 and set as below

Assign NAT Network Adapter for Master and worker nodes

Since internet is needed for the nodes, NAT network is used
Select Adapter 2 and Configure as below

Verify network configuration

Start Master and Worker nodes from VirtualBox

Check network settings in each node

Two Ethernet networks must be connected
Click on Ethernet 1 Settings -> IPv4 -> Manual
- For Master/Driver
  - Address : 192.168.56.101
  - Network Mask : 255.255.255.0
  - Gateway : 192.168.56.1
- For node1
  - Address : 192.168.56.102
  - Network Mask : 255.255.255.0
  - Gateway : 192.168.56.1
- For node2
  - Address : 192.168.56.103
  - Network Mask : 255.255.255.0
  - Gateway : 192.168.56.1
Click on Ethernet 2 Settings -> IPv4 -> Automatic(DHCP) in all the nodes

Check network connectivity between Hosts and Nodes

Get all Host-only network IPs

master - 192.168.56.101
node1 - 192.168.56.102
node2 - 192.168.56.103

Set hostname

Open hostname file on master, node1 and node2 and set its respective hostnames as below

sudo nano /etc/hostname

# Add below lines
* On master - master.spark.com
* On node1  - node1.spark.com
* On node2 - node2.spark.com

Add network information to hosts file of master, node1 and node2

sudo nano /etc/hosts

# Add below lines
192.168.56.101	master.spark.com
192.168.56.102	node1.spark.com
192.168.56.103	node2.spark.com

After adding reboot all the machines

reboot

Install java on all the nodes

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install openjdk-11-jdk

java -version

Setup SSH only in master

sudo apt-get install openssh-server openssh-client

ssh-keygen -t rsa -P ""

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

ssh-copy-id ubuntu@master.spark.com

ssh-copy-id ubuntu@node1.spark.com

ssh-copy-id ubuntu@node2.spark.com

Download and install latest Apache spark on all the nodes

tar -xvf spark-3.1.2-bin-hadoop3.2.tgz

sudo mv spark-3.1.2-bin-hadoop3.2 /usr/local/spark

sudo nano ~/.bashrc

# Add below lines
export PATH=$PATH:/usr/local/spark/bin

source ~/.bashrc

Configure Master information on all the nodes

cd /usr/local/spark/conf

cp spark-env.sh.template spark-env.sh

sudo nano spark-env.sh

# Add below lines
export SPARK_MASTER_HOST=master.spark.com

Configure Slaves/Worker information only on Master

sudo nano slaves

# Add below lines
node1.spark.com
node2.spark.com

Disable firewall

sudo ufw disable

Start spark from master

cd /usr/local/spark

./sbin/start-all.sh

Open Spark URL `http://192.168.56.101:8080/`

Configure Jupyter Notebook

pip install jupyter

sudo nano ~/.bashrc

# Add below lines 
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

source ~/.bashrc

Run Jupyter Notebook

jupyter notebook --ip 0.0.0.0

Open Jupyter Notebook from URL `http://192.168.56.101:8888/`

Start Coding

View application from Spark Master URL

Great, iNNovationMerge hope that you have understood how to Set up a multi-node Apache Spark Cluster on a Laptop

Virtualization Ubuntu Linux VirtualBox Spark

Predictive capture using Tensorflow and Raspberry Pi

2023-10-08 Hardware

Python IoT RaspberryPi Tensorflow Deep learning

Getting started with MSP-EXP430G2 LaunchPad Experimenter Board

2021-06-06 Hardware

IoT MSP430 Infrastructure Sensors Microcontroller

Setting up a multi-node Apache Spark Cluster on a Laptop

Prerequisite

What?

Why?

How?

Hardware’s Required

Software’s Required

Network Requirements

Implementation

Create Master and Worker Nodes

Create Host-only network

Assign Host-only network for Master and worker nodes

Assign NAT Network Adapter for Master and worker nodes

Verify network configuration

Start Master and Worker nodes from VirtualBox

Check network settings in each node

Check network connectivity between Hosts and Nodes

Get all Host-only network IPs

Set hostname

Add network information to hosts file of master, node1 and node2

Install java on all the nodes

Setup SSH only in master

Download and install latest Apache spark on all the nodes

Configure Master information on all the nodes

Configure Slaves/Worker information only on Master

Disable firewall

Start spark from master

Open Spark URL http://192.168.56.101:8080/

Configure Jupyter Notebook

Run Jupyter Notebook

Open Jupyter Notebook from URL http://192.168.56.101:8888/

View application from Spark Master URL

Open Spark URL `http://192.168.56.101:8080/`

Open Jupyter Notebook from URL `http://192.168.56.101:8888/`