How to install Hadoop on RHEL 8 / CentOS 8 Linux

Apache Hadoop is an open source framework used for distributed storage as well as distributed processing of big data on clusters of computers which runs on commodity hardwares. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides API for requesting and allocating resource in the Hadoop cluster.

The Apache Hadoop framework is composed of the following modules:

  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • YARN
  • MapReduce

This article explains how to install Hadoop Version 2 on RHEL 8 or CentOS 8. We will install HDFS (Namenode and Datanode), YARN, MapReduce on the single node cluster in Pseudo Distributed Mode which is distributed simulation on a single machine. Each Hadoop daemon such as hdfs, yarn, mapreduce etc. will run as a separate/individual java process.

In this tutorial you will learn:

  • How to add users for Hadoop Environment
  • How to install and configure the Oracle JDK
  • How to configure passwordless SSH
  • How to install Hadoop and configure necessary related xml files
  • How to start the Hadoop Cluster
  • How to access NameNode and ResourceManager Web UI

HDFS Architecture

HDFS Architecture.

Software Requirements and Conventions Used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System RHEL 8 / CentOS 8
Software Hadoop 2.8.5, Oracle JDK 1.8
Other Privileged access to your Linux system as root or via the sudo command.
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

Add users for Hadoop Environment

Create the new user and group using the command:

# useradd hadoop
# passwd hadoop
[root@hadoop ~]# useradd hadoop
[root@hadoop ~]# passwd hadoop
Changing password for user hadoop.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
[root@hadoop ~]# cat /etc/passwd | grep hadoop
hadoop:x:1000:1000::/home/hadoop:/bin/bash

Install and configure the Oracle JDK

Download and install the jdk-8u202-linux-x64.rpm official package to install the Oracle JDK.

[root@hadoop ~]# rpm -ivh jdk-8u202-linux-x64.rpm
warning: jdk-8u202-linux-x64.rpm: Header V3 RSA/SHA256 Signature, key ID ec551f03: NOKEY
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:jdk1.8-2000:1.8.0_202-fcs        ################################# [100%]
Unpacking JAR files...
        tools.jar...
        plugin.jar...
        javaws.jar...
        deploy.jar...
        rt.jar...
        jsse.jar...
        charsets.jar...
        localedata.jar...


After installation to verify the java has been successfully configured, run the following commands :

[root@hadoop ~]# java -version
java version "1.8.0_202"
Java(TM) SE Runtime Environment (build 1.8.0_202-b08)
Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode)

[root@hadoop ~]# update-alternatives --config java

There is 1 program that provides 'java'.

  Selection    Command
-----------------------------------------------
*+ 1           /usr/java/jdk1.8.0_202-amd64/jre/bin/java

Configure passwordless SSH

Install the Open SSH Server and Open SSH Client or if it already installed then it will list down the below packages.

[root@hadoop ~]# rpm -qa | grep openssh*
openssh-server-7.8p1-3.el8.x86_64
openssl-libs-1.1.1-6.el8.x86_64
openssl-1.1.1-6.el8.x86_64
openssh-clients-7.8p1-3.el8.x86_64
openssh-7.8p1-3.el8.x86_64
openssl-pkcs11-0.4.8-2.el8.x86_64

Generate Public and Private Key Pairs with the following command. The terminal will prompt for entering the file name. Press ENTER and proceed. After that copy the public keys form id_rsa.pub to authorized_keys.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 640 ~/.ssh/authorized_keys
[hadoop@hadoop ~]$ ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:H+LLPkaJJDD7B0f0Je/NFJRP5/FUeJswMmZpJFXoelg hadoop@hadoop.sandbox.com
The key's randomart image is:
+---[RSA 2048]----+
|     .. ..++*o .o|
|  o   .. +.O.+o.+|
|   + .  . * +oo==|
|  . o o  . E  .oo|
|   . = .S.* o    |
|    . o.o= o     |
|     . .. o      |
|       .o.       |
|       o+.       |
+----[SHA256]-----+
[hadoop@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@hadoop ~]$ chmod 640 ~/.ssh/authorized_keys

Verify the password-less ssh configuration with the command :

$ ssh 
[hadoop@hadoop ~]$ ssh hadoop.sandbox.com
Web console: https://hadoop.sandbox.com:9090/ or https://192.168.1.108:9090/

Last login: Sat Apr 13 12:09:55 2019
[hadoop@hadoop ~]$

Install Hadoop and configure related xml files

Download and extract Hadoop 2.8.5 from Apache official website.

# wget https://archive.apache.org/dist/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
# tar -xzvf hadoop-2.8.5.tar.gz
[root@rhel8-sandbox ~]# wget https://archive.apache.org/dist/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
--2019-04-13 11:14:03--  https://archive.apache.org/dist/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
Resolving archive.apache.org (archive.apache.org)... 163.172.17.199
Connecting to archive.apache.org (archive.apache.org)|163.172.17.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 246543928 (235M) [application/x-gzip]
Saving to: ‘hadoop-2.8.5.tar.gz’

hadoop-2.8.5.tar.gz                       100%[=====================================================================================>] 235.12M  1.47MB/s    in 2m 53s

2019-04-13 11:16:57 (1.36 MB/s) - ‘hadoop-2.8.5.tar.gz’ saved [246543928/246543928]

Setting up the environment variables

Edit the bashrc for the Hadoop user via setting up the following Hadoop environment variables :



export HADOOP_HOME=/home/hadoop/hadoop-2.8.5
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Source the .bashrc in current login session.

$ source ~/.bashrc

Edit the hadoop-env.sh file which is in /etc/hadoop inside the Hadoop installation directory and make the following changes and check if you want to change any other configurations.

export JAVA_HOME=${JAVA_HOME:-"/usr/java/jdk1.8.0_202-amd64"}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/hadoop/hadoop-2.8.5/etc/hadoop"}

Configuration Changes in core-site.xml file

Edit the core-site.xml with vim or you can use any of the editors. The file is under /etc/hadoop inside hadoop home directory and add following entries.

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop.sandbox.com:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadooptmpdata</value>
</property>
</configuration>

In addition, create the directory under hadoop home folder.

$ mkdir hadooptmpdata

Configuration Changes in hdfs-site.xml file

Edit the hdfs-site.xml which is present under the same location i.e /etc/hadoop inside hadoop installation directory and create the Namenode/Datanode directories under hadoop user home directory.

$ mkdir -p hdfs/namenode
$ mkdir -p hdfs/datanode
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>

Configuration Changes in mapred-site.xml file

Copy the mapred-site.xml from mapred-site.xml.template using cp command and then edit the mapred-site.xml placed in /etc/hadoop under hadoop instillation directory with the following changes.

$ cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Configuration Changes in yarn-site.xml file

Edit yarn-site.xml with the following entries.



<configuration>
<property>
<name>mapreduceyarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Starting the Hadoop Cluster

Format the namenode before using it for the first time. As hadoop user run the below command to format the Namenode.

$ hdfs namenode -format
[hadoop@hadoop ~]$ hdfs namenode -format
19/04/13 11:54:10 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   user = hadoop
STARTUP_MSG:   host = hadoop.sandbox.com/192.168.1.108
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.8.5
19/04/13 11:54:17 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
19/04/13 11:54:17 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
19/04/13 11:54:17 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
19/04/13 11:54:18 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
19/04/13 11:54:18 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
19/04/13 11:54:18 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
19/04/13 11:54:18 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
19/04/13 11:54:18 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
19/04/13 11:54:18 INFO util.GSet: Computing capacity for map NameNodeRetryCache
19/04/13 11:54:18 INFO util.GSet: VM type       = 64-bit
19/04/13 11:54:18 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
19/04/13 11:54:18 INFO util.GSet: capacity      = 2^15 = 32768 entries
19/04/13 11:54:18 INFO namenode.FSImage: Allocated new BlockPoolId: BP-415167234-192.168.1.108-1555142058167
19/04/13 11:54:18 INFO common.Storage: Storage directory /home/hadoop/hdfs/namenode has been successfully formatted.
19/04/13 11:54:18 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression
19/04/13 11:54:18 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
19/04/13 11:54:18 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
19/04/13 11:54:18 INFO util.ExitUtil: Exiting with status 0
19/04/13 11:54:18 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop.sandbox.com/192.168.1.108
************************************************************/

Once the Namenode has been formatted then start the HDFS using the start-dfs.sh script.

$ start-dfs.sh 
[hadoop@hadoop ~]$ start-dfs.sh
Starting namenodes on [hadoop.sandbox.com]
hadoop.sandbox.com: starting namenode, logging to /home/hadoop/hadoop-2.8.5/logs/hadoop-hadoop-namenode-hadoop.sandbox.com.out
hadoop.sandbox.com: starting datanode, logging to /home/hadoop/hadoop-2.8.5/logs/hadoop-hadoop-datanode-hadoop.sandbox.com.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:e+NfCeK/kvnignWDHgFvIkHjBWwghIIjJkfjygR7NkI.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
hadoop@0.0.0.0's password:
0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.8.5/logs/hadoop-hadoop-secondarynamenode-hadoop.sandbox.com.out

To start the YARN services you need to execute the yarn start script i.e. start-yarn.sh

$ start-yarn.sh
[hadoop@hadoop ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-hadoop.sandbox.com.out
hadoop.sandbox.com: starting nodemanager, logging to /home/hadoop/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-hadoop.sandbox.com.out

To verify all the Hadoop services/daemons are started successfully you can use the jps command.

$ jps
2033 NameNode
2340 SecondaryNameNode
2566 ResourceManager
2983 Jps
2139 DataNode
2671 NodeManager

Now we can check the current Hadoop version you can use below command :

$ hadoop version

or

$ hdfs version
[hadoop@hadoop ~]$ hadoop version
Hadoop 2.8.5
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8
Compiled by jdu on 2018-09-10T03:32Z
Compiled with protoc 2.5.0
From source with checksum 9942ca5c745417c14e318835f420733
This command was run using /home/hadoop/hadoop-2.8.5/share/hadoop/common/hadoop-common-2.8.5.jar

[hadoop@hadoop ~]$ hdfs version
Hadoop 2.8.5
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8
Compiled by jdu on 2018-09-10T03:32Z
Compiled with protoc 2.5.0
From source with checksum 9942ca5c745417c14e318835f420733
This command was run using /home/hadoop/hadoop-2.8.5/share/hadoop/common/hadoop-common-2.8.5.jar
[hadoop@hadoop ~]$


HDFS Command Line Interface

To access the HDFS and create some directories top of DFS you can use HDFS CLI.

$ hdfs dfs -mkdir /testdata
$ hdfs dfs -mkdir /hadoopdata
$ hdfs dfs -ls /
[hadoop@hadoop ~]$ hdfs dfs -ls /
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2019-04-13 11:58 /hadoopdata
drwxr-xr-x   - hadoop supergroup          0 2019-04-13 11:59 /testdata

Access the Namenode and YARN from Browser

You can access the both the Web UI for NameNode and YARN Resource Manager via any of the browsers like Google Chrome/Mozilla Firefox.

Namenode Web UI – http://<hadoop cluster hostname/IP address>:50070

Namenode Web User Interface

Namenode Web User Interface.
HDFS Detail Information

HDFS Detail Information.
HDFS Directory Browsing

HDFS Directory Browsing.

The YARN Resource Manager (RM) web interface will display all running jobs on current Hadoop Cluster.

Resource Manager Web UI – http://<hadoop cluster hostname/IP address>:8088

Resource Manager(YARN) Web User Interface

Resource Manager(YARN) Web User Interface.

Conclusion

The world is changing the way it is operating currently and Big-data is playing an major role in this phase. Hadoop is a framework that makes our lif easy while working on large sets of data. There are improvements on all the fronts. The future is exciting.