Setting up a multi-node Apache Spark Cluster on a Laptop | iNNovationMerge

Setting up a multi-node Apache Spark Cluster on a Laptop



For Feedbacks | Enquiries | Questions | Comments - Contact us @ innovationmerge@gmail.com


Prerequisite

What?

  • Apache Spark is a fast and general purpose cluster computing system. It is used as a lightning fast unified analytics engine for bigdata & Machine Learning applications.
  • Apache Spark is an Engine
    • To process data in real time & batch mode
    • To respond in subsecond
    • To perform in Memory processing
  • According to Spark documentation it is an alternative to Hadoop Map Reduce
    • 10-100 percent faster than Hadoop
    • 10 Times faster than accessing data from disk
    • Very fast speed and ease of use
  • It provides high level API’s in JAVA, Scala, Python & R
  • This article explains how to setup and use Apache Spark in a multi node cluster environment.

Why?

  • Apache Spark is used for distributed, in memory processing and it is popular due to below offerings
    • Spark Core - Allows to consume and process Batch data
    • Spark Streaming - Allows to consume and process continous data streams
    • Clients - Interactive Processing
    • Spark SQL - Allows to use SQL Queries for structured data processing
    • MLib - Machine Learning library that delivers high quality algorithms
    • GraphX - It is a library for Graph Processing
  • Apache spark support multiple resource manager
    • Standalone - It is a basic cluster manager that comes with spark compute engine. It provides basic funcationalities like Memory management, Fault recovery, Task Scheduling, Interaction with cluster manager
    • Apache YARN - It is the cluster manager for Hadoop
    • Apache Mesos - It is another general purpose cluster manager
    • Kubernetes - It is a general purpose container orchestration platform
  • Every developer needs an local environment to run and test the Spark application. This article explaines detailed steps in setting up the multinode Apache Spark Cluster.

How?

  • Block Diagram (Source: iNNovationMerge)
  • When an application is submitted to Spark, it will create one driver process and multiple executer process for application on multiple nodes.
  • Entire set of Driver & Executors are available for application.
  • Driver is responsible for Analysing, Distributing, Scheduling, Monitoring and maintaining all the information during the lifetime of the application.
  • Each node can have multiple executor’s. Executer’s are responsible for executing the code assigned to them by Driver and reports the status back to the Driver.

Hardware’s Required

  • The steps of this article are tested with below Laptop configuration
  • RAM : 12GB DDR4
  • Processor : Intel Core i5, 4 cores
  • Graphic Processor : NVIDIA GeForce 940MX

Software’s Required

Network Requirements

  • Internet to download packages

Implementation

Create Master and Worker Nodes

  • Create below configuration Machine’s in VirtualBox(Refer Prerequisite section)
  • Master - 2 vCPU, 3GB RAM, Ubuntu OS
  • Node1 - 2 vCPU, 1GB RAM, Ubuntu OS
  • Node2 - 2 vCPU, 1GB RAM, Ubuntu OS

Create Host-only network

  • Apache spark needs Static IP address to communicate between nodes.
  • VitualBox has Host-only network mode for communicating between a host and guests.
  • In simple words nodes created with this network mode can communicate between each other and The VirtualBox host machine(Laptop) can acess all Vm’s connected to the host-only network.
  • In VirutalBox software Navigate File -> Host Network Manager
  • Host Network Manager (Source: iNNovationMerge)
  • Click on Create -> Configure Adapter manually with IPv4 Address : 192.168.56.1 and Network Mask : 255.255.255.0
  • Configure Adapter manually (Source: iNNovationMerge)

Assign Host-only network for Master and worker nodes

  • Select the machine -> Settings
  • Select machine (Source: iNNovationMerge)
  • Navigate to Network -> Adapter1 and set as below
  • Adapter 1 Settings (Source: iNNovationMerge)

Assign NAT Network Adapter for Master and worker nodes

  • Since internet is needed for the nodes, NAT network is used
  • Select Adapter 2 and Configure as below
  • Adapter 2 Settings (Source: iNNovationMerge)

Verify network configuration

  • verify Network (Source: iNNovationMerge)

Start Master and Worker nodes from VirtualBox

  • Start machine (Source: iNNovationMerge)

Check network settings in each node

  • Two Ethernet networks must be connected
    • Network settings (Source: iNNovationMerge)
  • Click on Ethernet 1 Settings -> IPv4 -> Manual
    • Ethernet 1 Settings (Source: iNNovationMerge)
    • For Master/Driver
      • Address : 192.168.56.101
      • Network Mask : 255.255.255.0
      • Gateway : 192.168.56.1
    • For node1
      • Address : 192.168.56.102
      • Network Mask : 255.255.255.0
      • Gateway : 192.168.56.1
    • For node2
      • Address : 192.168.56.103
      • Network Mask : 255.255.255.0
      • Gateway : 192.168.56.1
  • Click on Ethernet 2 Settings -> IPv4 -> Automatic(DHCP) in all the nodes
    • Ethernet 2 Settings (Source: iNNovationMerge)

Check network connectivity between Hosts and Nodes

  • Connectivity (Source: iNNovationMerge)

Get all Host-only network IPs

  • master - 192.168.56.101
  • node1 - 192.168.56.102
  • node2 - 192.168.56.103

Set hostname

  • Open hostname file on master, node1 and node2 and set its respective hostnames as below
  • Set Hostname (Source: iNNovationMerge)
sudo nano /etc/hostname

# Add below lines
* On master - master.spark.com
* On node1  - node1.spark.com
* On node2 - node2.spark.com

Add network information to hosts file of master, node1 and node2

sudo nano /etc/hosts

# Add below lines
192.168.56.101	master.spark.com
192.168.56.102	node1.spark.com
192.168.56.103	node2.spark.com
  • After adding reboot all the machines
reboot

Install java on all the nodes

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install openjdk-11-jdk

java -version

Setup SSH only in master

  • SSH Setup (Source: iNNovationMerge)
    sudo apt-get install openssh-server openssh-client
    
    ssh-keygen -t rsa -P ""
    
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    
    ssh-copy-id ubuntu@master.spark.com
    
    ssh-copy-id ubuntu@node1.spark.com
    
    ssh-copy-id ubuntu@node2.spark.com

    Download and install latest Apache spark on all the nodes

    tar -xvf spark-3.1.2-bin-hadoop3.2.tgz
    
    sudo mv spark-3.1.2-bin-hadoop3.2 /usr/local/spark
    
    sudo nano ~/.bashrc
    
    # Add below lines
    export PATH=$PATH:/usr/local/spark/bin
    
    source ~/.bashrc

    Configure Master information on all the nodes

    cd /usr/local/spark/conf
    
    cp spark-env.sh.template spark-env.sh
    
    sudo nano spark-env.sh
    
    # Add below lines
    export SPARK_MASTER_HOST=master.spark.com

Configure Slaves/Worker information only on Master

sudo nano slaves

# Add below lines
node1.spark.com
node2.spark.com

Disable firewall

sudo ufw disable

Start spark from master

cd /usr/local/spark

./sbin/start-all.sh

Open Spark URL http://192.168.56.101:8080/

  • Spark URL (Source: iNNovationMerge)

Configure Jupyter Notebook

pip install jupyter

sudo nano ~/.bashrc

# Add below lines 
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

source ~/.bashrc

Run Jupyter Notebook

jupyter notebook --ip 0.0.0.0

Open Jupyter Notebook from URL http://192.168.56.101:8888/

  • Start Coding
  • Jupyter Notebook (Source: iNNovationMerge)

View application from Spark Master URL

  • Application (Source: iNNovationMerge)

Great, iNNovationMerge hope that you have understood how to Set up a multi-node Apache Spark Cluster on a Laptop


  TOC