Apache Hadoop is an open source framework used for distributed storage as well as distributed processing of big data on clusters of computers which runs on commodity hardwares. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides API for requesting and allocating resource in the Hadoop cluster.
The Apache Hadoop framework is composed of the following modules:
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- YARN
- MapReduce
This article explains how to install Hadoop Version 2 on RHEL 8 or CentOS 8. We will install HDFS (Namenode and Datanode), YARN, MapReduce on the single node cluster in Pseudo Distributed Mode which is distributed simulation on a single machine. Each Hadoop daemon such as hdfs, yarn, mapreduce etc. will run as a separate/individual java process.
In this tutorial you will learn:
- How to add users for Hadoop Environment
- How to install and configure the Oracle JDK
- How to configure passwordless SSH
- How to install Hadoop and configure necessary related xml files
- How to start the Hadoop Cluster
- How to access NameNode and ResourceManager Web UI
Software Requirements and Conventions Used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | RHEL 8 / CentOS 8 |
Software | Hadoop 2.8.5, Oracle JDK 1.8 |
Other | Privileged access to your Linux system as root or via the sudo command. |
Conventions |
# – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux commands to be executed as a regular non-privileged user |
Add users for Hadoop Environment
Create the new user and group using the command:
# useradd hadoop # passwd hadoop
[root@hadoop ~]# useradd hadoop [root@hadoop ~]# passwd hadoop Changing password for user hadoop. New password: Retype new password: passwd: all authentication tokens updated successfully. [root@hadoop ~]# cat /etc/passwd | grep hadoop hadoop:x:1000:1000::/home/hadoop:/bin/bash
Install and configure the Oracle JDK
Download and install the jdk-8u202-linux-x64.rpm official package to install the Oracle JDK.
[root@hadoop ~]# rpm -ivh jdk-8u202-linux-x64.rpm warning: jdk-8u202-linux-x64.rpm: Header V3 RSA/SHA256 Signature, key ID ec551f03: NOKEY Verifying... ################################# [100%] Preparing... ################################# [100%] Updating / installing... 1:jdk1.8-2000:1.8.0_202-fcs ################################# [100%] Unpacking JAR files... tools.jar... plugin.jar... javaws.jar... deploy.jar... rt.jar... jsse.jar... charsets.jar... localedata.jar...
After installation to verify the java has been successfully configured, run the following commands :
[root@hadoop ~]# java -version java version "1.8.0_202" Java(TM) SE Runtime Environment (build 1.8.0_202-b08) Java HotSpot(TM) 64-Bit Server VM (build 25.202-b08, mixed mode) [root@hadoop ~]# update-alternatives --config java There is 1 program that provides 'java'. Selection Command ----------------------------------------------- *+ 1 /usr/java/jdk1.8.0_202-amd64/jre/bin/java
Configure passwordless SSH
Install the Open SSH Server and Open SSH Client or if it already installed then it will list down the below packages.
[root@hadoop ~]# rpm -qa | grep openssh* openssh-server-7.8p1-3.el8.x86_64 openssl-libs-1.1.1-6.el8.x86_64 openssl-1.1.1-6.el8.x86_64 openssh-clients-7.8p1-3.el8.x86_64 openssh-7.8p1-3.el8.x86_64 openssl-pkcs11-0.4.8-2.el8.x86_64
Generate Public and Private Key Pairs with the following command. The terminal will prompt for entering the file name. Press ENTER
and proceed. After that copy the public keys form id_rsa.pub
to authorized_keys
.
$ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys $ chmod 640 ~/.ssh/authorized_keys
[hadoop@hadoop ~]$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:H+LLPkaJJDD7B0f0Je/NFJRP5/FUeJswMmZpJFXoelg hadoop@hadoop.sandbox.com The key's randomart image is: +---[RSA 2048]----+ | .. ..++*o .o| | o .. +.O.+o.+| | + . . * +oo==| | . o o . E .oo| | . = .S.* o | | . o.o= o | | . .. o | | .o. | | o+. | +----[SHA256]-----+ [hadoop@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys [hadoop@hadoop ~]$ chmod 640 ~/.ssh/authorized_keys
Verify the password-less ssh configuration with the command :
$ ssh
[hadoop@hadoop ~]$ ssh hadoop.sandbox.com Web console: https://hadoop.sandbox.com:9090/ or https://192.168.1.108:9090/ Last login: Sat Apr 13 12:09:55 2019 [hadoop@hadoop ~]$
Install Hadoop and configure related xml files
Download and extract Hadoop 2.8.5 from Apache official website.
# wget https://archive.apache.org/dist/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz # tar -xzvf hadoop-2.8.5.tar.gz
[root@rhel8-sandbox ~]# wget https://archive.apache.org/dist/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz --2019-04-13 11:14:03-- https://archive.apache.org/dist/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz Resolving archive.apache.org (archive.apache.org)... 163.172.17.199 Connecting to archive.apache.org (archive.apache.org)|163.172.17.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 246543928 (235M) [application/x-gzip] Saving to: ‘hadoop-2.8.5.tar.gz’ hadoop-2.8.5.tar.gz 100%[=====================================================================================>] 235.12M 1.47MB/s in 2m 53s 2019-04-13 11:16:57 (1.36 MB/s) - ‘hadoop-2.8.5.tar.gz’ saved [246543928/246543928]
Setting up the environment variables
Edit the bashrc
for the Hadoop user via setting up the following Hadoop environment variables :
export HADOOP_HOME=/home/hadoop/hadoop-2.8.5
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Source the .bashrc
in current login session.
$ source ~/.bashrc
Edit the hadoop-env.sh
file which is in /etc/hadoop
inside the Hadoop installation directory and make the following changes and check if you want to change any other configurations.
export JAVA_HOME=${JAVA_HOME:-"/usr/java/jdk1.8.0_202-amd64"}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/hadoop/hadoop-2.8.5/etc/hadoop"}
Configuration Changes in core-site.xml file
Edit the core-site.xml
with vim or you can use any of the editors. The file is under /etc/hadoop
inside hadoop
home directory and add following entries.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop.sandbox.com:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadooptmpdata</value>
</property>
</configuration>
In addition, create the directory under hadoop
home folder.
$ mkdir hadooptmpdata
Configuration Changes in hdfs-site.xml file
Edit the hdfs-site.xml
which is present under the same location i.e /etc/hadoop
inside hadoop
installation directory and create the Namenode/Datanode
directories under hadoop
user home directory.
$ mkdir -p hdfs/namenode $ mkdir -p hdfs/datanode
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Configuration Changes in mapred-site.xml file
Copy the mapred-site.xml
from mapred-site.xml.template
using cp
command and then edit the mapred-site.xml
placed in /etc/hadoop
under hadoop
instillation directory with the following changes.
$ cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configuration Changes in yarn-site.xml file
Edit yarn-site.xml
with the following entries.
<configuration>
<property>
<name>mapreduceyarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Starting the Hadoop Cluster
Format the namenode before using it for the first time. As hadoop user run the below command to format the Namenode.
$ hdfs namenode -format
[hadoop@hadoop ~]$ hdfs namenode -format 19/04/13 11:54:10 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: user = hadoop STARTUP_MSG: host = hadoop.sandbox.com/192.168.1.108 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.8.5 19/04/13 11:54:17 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033 19/04/13 11:54:17 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0 19/04/13 11:54:17 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000 19/04/13 11:54:18 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10 19/04/13 11:54:18 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10 19/04/13 11:54:18 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25 19/04/13 11:54:18 INFO namenode.FSNamesystem: Retry cache on namenode is enabled 19/04/13 11:54:18 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis 19/04/13 11:54:18 INFO util.GSet: Computing capacity for map NameNodeRetryCache 19/04/13 11:54:18 INFO util.GSet: VM type = 64-bit 19/04/13 11:54:18 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB 19/04/13 11:54:18 INFO util.GSet: capacity = 2^15 = 32768 entries 19/04/13 11:54:18 INFO namenode.FSImage: Allocated new BlockPoolId: BP-415167234-192.168.1.108-1555142058167 19/04/13 11:54:18 INFO common.Storage: Storage directory /home/hadoop/hdfs/namenode has been successfully formatted. 19/04/13 11:54:18 INFO namenode.FSImageFormatProtobuf: Saving image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 using no compression 19/04/13 11:54:18 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds. 19/04/13 11:54:18 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 19/04/13 11:54:18 INFO util.ExitUtil: Exiting with status 0 19/04/13 11:54:18 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at hadoop.sandbox.com/192.168.1.108 ************************************************************/
Once the Namenode has been formatted then start the HDFS using the start-dfs.sh
script.
$ start-dfs.sh
[hadoop@hadoop ~]$ start-dfs.sh Starting namenodes on [hadoop.sandbox.com] hadoop.sandbox.com: starting namenode, logging to /home/hadoop/hadoop-2.8.5/logs/hadoop-hadoop-namenode-hadoop.sandbox.com.out hadoop.sandbox.com: starting datanode, logging to /home/hadoop/hadoop-2.8.5/logs/hadoop-hadoop-datanode-hadoop.sandbox.com.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. ECDSA key fingerprint is SHA256:e+NfCeK/kvnignWDHgFvIkHjBWwghIIjJkfjygR7NkI. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts. hadoop@0.0.0.0's password: 0.0.0.0: starting secondarynamenode, logging to /home/hadoop/hadoop-2.8.5/logs/hadoop-hadoop-secondarynamenode-hadoop.sandbox.com.out
To start the YARN services you need to execute the yarn start script i.e. start-yarn.sh
$ start-yarn.sh
[hadoop@hadoop ~]$ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /home/hadoop/hadoop-2.8.5/logs/yarn-hadoop-resourcemanager-hadoop.sandbox.com.out hadoop.sandbox.com: starting nodemanager, logging to /home/hadoop/hadoop-2.8.5/logs/yarn-hadoop-nodemanager-hadoop.sandbox.com.out
To verify all the Hadoop services/daemons are started successfully you can use the jps
command.
$ jps 2033 NameNode 2340 SecondaryNameNode 2566 ResourceManager 2983 Jps 2139 DataNode 2671 NodeManager
Now we can check the current Hadoop version you can use below command :
$ hadoop version
or
$ hdfs version
[hadoop@hadoop ~]$ hadoop version Hadoop 2.8.5 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8 Compiled by jdu on 2018-09-10T03:32Z Compiled with protoc 2.5.0 From source with checksum 9942ca5c745417c14e318835f420733 This command was run using /home/hadoop/hadoop-2.8.5/share/hadoop/common/hadoop-common-2.8.5.jar [hadoop@hadoop ~]$ hdfs version Hadoop 2.8.5 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 0b8464d75227fcee2c6e7f2410377b3d53d3d5f8 Compiled by jdu on 2018-09-10T03:32Z Compiled with protoc 2.5.0 From source with checksum 9942ca5c745417c14e318835f420733 This command was run using /home/hadoop/hadoop-2.8.5/share/hadoop/common/hadoop-common-2.8.5.jar [hadoop@hadoop ~]$
HDFS Command Line Interface
To access the HDFS and create some directories top of DFS you can use HDFS CLI.
$ hdfs dfs -mkdir /testdata $ hdfs dfs -mkdir /hadoopdata $ hdfs dfs -ls /
[hadoop@hadoop ~]$ hdfs dfs -ls / Found 2 items drwxr-xr-x - hadoop supergroup 0 2019-04-13 11:58 /hadoopdata drwxr-xr-x - hadoop supergroup 0 2019-04-13 11:59 /testdata
Access the Namenode and YARN from Browser
You can access the both the Web UI for NameNode and YARN Resource Manager via any of the browsers like Google Chrome/Mozilla Firefox.
Namenode Web UI – http://<hadoop cluster hostname/IP address>:50070
The YARN Resource Manager (RM) web interface will display all running jobs on current Hadoop Cluster.
Resource Manager Web UI – http://<hadoop cluster hostname/IP address>:8088
Conclusion
The world is changing the way it is operating currently and Big-data is playing an major role in this phase. Hadoop is a framework that makes our lif easy while working on large sets of data. There are improvements on all the fronts. The future is exciting.