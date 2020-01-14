Apache Spark is a distributed computing system. It consists of a master and one or more slaves, where the master distributes the work among the slaves, thus giving the ability to use our many computers to work on one task. One could guess that this is indeed a powerful tool where tasks need large computations to complete, but can be split into smaller chunks of steps that can be pushed to the slaves to work on. Once our cluster is up and running, we can write programs to run on it in Python, Java, and Scala.
In this tutorial we will work on a single machine running Red Hat Enterprise Linux 8, and will install the Spark master and slave to the same machine, but keep in mind that the steps describing the slave setup can be applied to any number of computers, thus creating a real cluster that can process heavy workloads. We'll also add the necessary unit files for management, and run a simple example against the cluster shipped with the distributed package to ensure our system is operational.In this tutorial you will learn:
- How to install Spark master and slave
- How to add systemd unit files
- How to verify successful master-slave connection
- How to run a simple example job on the cluster
Software Requirements and Conventions Used
|Category
|Requirements, Conventions or Software Version Used
|System
|Red Hat Enterprise Linux 8
|Software
|Apache Spark 2.4.0
|Other
|Privileged access to your Linux system as root or via the
sudo command.
|Conventions
| # - requires given linux commands to be executed with root privileges either directly as a root user or by use of
sudo command $ - requires given linux commands to be executed as a regular non-privileged user
How to install spark on Redhat 8 step by step instructions
Apache Spark runs on JVM (Java Virtual Machine), so a working Java 8 installation is required for the applications to run. Aside from that, there are multiple shells shipped within the package, one of them is
pyspark, a python based shell. To work with that, you'll also need python 2 installed and set up.
- To get the URL of Spark's latest package, we need to visit the Spark downloads site. We need to choose the mirror closest to our location, and copy the URL provided by the download site. This also means that your URL may be different from the below example. We'll install the package under
/opt/, so we enter the directory as
root:
And feed the aquired URL to
# cd /opt
wgetto get the package:
# wget https://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
- We'll unpack the tarball:
# tar -xvf spark-2.4.0-bin-hadoop2.7.tgz
- And create a symlink to make our paths easier to remember in the next steps:
# ln -s /opt/spark-2.4.0-bin-hadoop2.7 /opt/spark
- We create a non-privileged user that will run both applications, master and slave:
And set it as owner of the whole
# useradd spark
/opt/sparkdirectory, recursively:
# chown -R spark:spark /opt/spark*
- We create a
systemdunit file
/etc/systemd/system/spark-master.servicefor the master service with the following content:
And also one for the slave service that will be
[Unit] Description=Apache Spark Master After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-master.sh ExecStop=/opt/spark/sbin/stop-master.sh [Install] WantedBy=multi-user.target
/etc/systemd/system/spark-slave.service.servicewith the below contents:
Note the highlighted spark URL. This is constructed with
[Unit] Description=Apache Spark Slave After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-slave.sh spark://rhel8lab.linuxconfig.org:7077 ExecStop=/opt/spark/sbin/stop-slave.sh [Install] WantedBy=multi-user.target
spark://<hostname-or-ip-address-of-the-master>:7077, in this case the lab machine that will run the master has the hostname
rhel8lab.linuxconfig.org. Your master's name will be different. Every slaves must be able to resolve this hostname, and reach the master on the specified port, which is port
7077by default.
- With the service files in place, we need to ask
systemdto re-read them:
# systemctl daemon-reload
- We can start our Spark master with
systemd:
# systemctl start spark-master.service
- To verify our master is running and functional, we can use systemd status:
# systemctl status spark-master.service spark-master.service - Apache Spark Master Loaded: loaded (/etc/systemd/system/spark-master.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2019-01-11 16:30:03 CET; 53min ago Process: 3308 ExecStop=/opt/spark/sbin/stop-master.sh (code=exited, status=0/SUCCESS) Process: 3339 ExecStart=/opt/spark/sbin/start-master.sh (code=exited, status=0/SUCCESS) Main PID: 3359 (java) Tasks: 27 (limit: 12544) Memory: 219.3M CGroup: /system.slice/spark-master.service 3359 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181.b13-9.el8.x86_64/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host [...] Jan 11 16:30:00 rhel8lab.linuxconfig.org systemd[1]: Starting Apache Spark Master... Jan 11 16:30:00 rhel8lab.linuxconfig.org start-master.sh[3339]: starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-1[...]The last line also indicates the main logfile of the master, which is in the
logsdirectory under the Spark base directory,
/opt/sparkin our case. By looking into this file, we should see a line in the end similar to the below example:
We should also find a line that tells us where the Master interface is listening:
2019-01-11 14:45:28 INFO Master:54 - I have been elected leader! New state: ALIVE
If we point a browser to the host machine's port
2019-01-11 16:30:03 INFO Utils:54 - Successfully started service 'MasterUI' on port 8080
8080, we should see the status page of the master, with no workers attached at the moment.
step 5. If we receive a "connection refused" error message in the browser, we probably need to open the port on the firewall:
# firewall-cmd --zone=public --add-port=8080/tcp --permanent success # firewall-cmd --reload success
- Our master is running, we'll attach a slave to it. We start the slave service:
# systemctl start spark-slave.service
- We can verify that our slave is running with systemd:
This output also provides the path to the logfile of the slave (or worker), which will be in the same directory, with "worker" in it's name. By checking this file, we should see something similar to the below output:
# systemctl status spark-slave.service spark-slave.service - Apache Spark Slave Loaded: loaded (/etc/systemd/system/spark-slave.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2019-01-11 16:31:41 CET; 1h 3min ago Process: 3515 ExecStop=/opt/spark/sbin/stop-slave.sh (code=exited, status=0/SUCCESS) Process: 3537 ExecStart=/opt/spark/sbin/start-slave.sh spark://rhel8lab.linuxconfig.org:7077 (code=exited, status=0/SUCCESS) Main PID: 3554 (java) Tasks: 26 (limit: 12544) Memory: 176.1M CGroup: /system.slice/spark-slave.service 3554 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181.b13-9.el8.x86_64/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker [...] Jan 11 16:31:39 rhel8lab.linuxconfig.org systemd[1]: Starting Apache Spark Slave... Jan 11 16:31:39 rhel8lab.linuxconfig.org start-slave.sh[3537]: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-spar[...]
This indicates that the worker is successfully connected to the master. In this same logfile we'll find a line that tells us the URL the worker is listening on:
2019-01-11 14:52:23 INFO Worker:54 - Connecting to master rhel8lab.linuxconfig.org:7077... 2019-01-11 14:52:23 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62059f4a{/metrics/json,null,AVAILABLE,@Spark} 2019-01-11 14:52:23 INFO TransportClientFactory:267 - Successfully created connection to rhel8lab.linuxconfig.org/10.0.2.15:7077 after 58 ms (0 ms spent in bootstraps) 2019-01-11 14:52:24 INFO Worker:54 - Successfully registered with master spark://rhel8lab.linuxconfig.org:7077
We can point our browser to the worker's status page, where it's master is listed.
2019-01-11 14:52:23 INFO WorkerWebUI:54 - Bound WorkerWebUI to 0.0.0.0, and started at http://rhel8lab.linuxconfig.org:8081At the master's logfile, a verifying line should appear:
If we reload the master's status page now, the worker should appear there as well, with a link to it's status page.
2019-01-11 14:52:24 INFO Master:54 - Registering worker 10.0.2.15:40815 with 2 cores, 1024.0 MB RAM
- To run a simple task on the cluster, we execute one of the examples shipped with the package we downloaded. Consider the following simple textfile
/opt/spark/test.file:
We will execute the
line1 word1 word2 word3 line2 word1 line3 word1 word2 word3 word4
wordcount.pyexample on it that will count the occurance of every word in the file. We can use the
sparkuser, no
rootprivileges needed.
As the task executes, a long output is provided. Close to the end of the output, the result is shown, the cluster calculates the needed information:
$ /opt/spark/bin/spark-submit /opt/spark/examples/src/main/python/wordcount.py /opt/spark/test.file 2019-01-11 15:56:57 INFO SparkContext:54 - Submitted application: PythonWordCount 2019-01-11 15:56:57 INFO SecurityManager:54 - Changing view acls to: spark 2019-01-11 15:56:57 INFO SecurityManager:54 - Changing modify acls to: spark [...]
With this we have seen our Apache Spark in action. Additional slave nodes can be installed and attached to scale the computing power of our cluster.
2019-01-11 15:57:05 INFO DAGScheduler:54 - Job 0 finished: collect at /opt/spark/examples/src/main/python/wordcount.py:40, took 1.619928 s line3: 1 line2: 1 line1: 1 word4: 1 word1: 3 word3: 2 word2: 2 [...]
