Apache Spark is a distributed computing system. It consists of a master and one or more slaves, where the master distributes the work among the slaves, thus giving the ability to use our many computers to work on one task. One could guess that this is indeed a powerful tool where tasks need large computations to complete, but can be split into smaller chunks of steps that can be pushed to the slaves to work on. Once our cluster is up and running, we can write programs to run on it in Python, Java, and Scala.
In this tutorial we will work on a single machine running Red Hat Enterprise Linux 8, and will install the Spark master and slave to the same machine, but keep in mind that the steps describing the slave setup can be applied to any number of computers, thus creating a real cluster that can process heavy workloads. We’ll also add the necessary unit files for management, and run a simple example against the cluster shipped with the distributed package to ensure our system is operational.
In this tutorial you will learn:
- How to install Spark master and slave
- How to add systemd unit files
- How to verify successful master-slave connection
- How to run a simple example job on the cluster