Apache Hadoop is an open source framework used for distributed storage as well as distributed processing of big data on clusters of computers which runs on commodity hardwares. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides API for requesting and allocating resource in the Hadoop cluster.

The Apache Hadoop framework is composed of the following modules:
  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • YARN
  • MapReduce

This article explains how to install Hadoop Version 2 on Ubuntu 18.04. We will install HDFS (Namenode and Datanode), YARN, MapReduce on the single node cluster in Pseudo Distributed Mode which is distributed simulation on a single machine. Each Hadoop daemon such as hdfs, yarn, mapreduce etc. will run as a separate/individual java process.

In this tutorial you will learn:
  • How to add users for Hadoop Environment
  • How to install and configure the Oracle JDK
  • How to configure passwordless SSH
  • How to install Hadoop and configure necessary related xml files
  • How to start the Hadoop Cluster
  • How to access NameNode and ResourceManager Web UI
Namenode Web User Interface
Namenode Web User Interface.

Software Requirements and Conventions Used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Ubuntu 18.04
Software Hadoop 2.8.5, Oracle JDK 1.8
Other Privileged access to your Linux system as root or via the sudo command.
Conventions # - requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ - requires given linux commands to be executed as a regular non-privileged user

Add users for Hadoop Environment



Create the new user and group using the command :

# add user
Add New User for Hadoop
Add New User for Hadoop.

Install and configure the Oracle JDK

Download and extract the Java archive under the /opt directory.

# cd /opt
# tar -xzvf jdk-8u192-linux-x64.tar.gz
or
$ tar -xzvf jdk-8u192-linux-x64.tar.gz -C /opt

To set the JDK 1.8 Update 192 as the default JVM we will use the following commands :

# update-alternatives --install /usr/bin/java java /opt/jdk1.8.0_192/bin/java 100
# update-alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_192/bin/javac 100

After installation to verify the java has been successfully configured, run the following commands :

# update-alternatives --display java
# update-alternatives --display javac
OracleJDK Installation & Configuration
OracleJDK Installation & Configuration.

Configure passwordless SSH



Install the Open SSH Server and Open SSH Client with the command :

# sudo apt-get install openssh-server openssh-client 

Generate Public and Private Key Pairs with the following command. The terminal will prompt for entering the file name. Press ENTER and proceed. After that copy the public keys form id_rsa.pub to authorized_keys.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Passwordless SSH Configuration
Passwordless SSH Configuration.

Verify the password-less ssh configuration with the command :

$ ssh localhost
Passwordless SSH Check
Passwordless SSH Check.

Download and extract Hadoop 2.8.5 from Apache official website.

# tar -xzvf hadoop-2.8.5.tar.gz

Setting up the environment variables



Edit the bashrc for the Hadoop user via setting up the following Hadoop environment variables :

export HADOOP_HOME=/home/hadoop/hadoop-2.8.5
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Source the .bashrc in current login session.

$ source ~/.bashrc

Edit the hadoop-env.sh file which is in /etc/hadoop inside the Hadoop installation directory and make the following changes and check if you want to change any other configurations.

export JAVA_HOME=/opt/jdk1.8.0_192
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/hadoop/hadoop-2.8.5/etc/hadoop"}
Changes in hadoop-env.sh File
Changes in hadoop-env.sh File.

Configuration Changes in core-site.xml file

Edit the core-site.xml with vim or you can use any of the editors. The file is under /etc/hadoop inside hadoop home directory and add following entries.

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadooptmpdata</value>
</property>
</configuration>

In addition, create the directory under hadoop home folder.

$ mkdir hadooptmpdata
Configuration For core-site.xml File
Configuration For core-site.xml File.

Configuration Changes in hdfs-site.xml file



Edit the hdfs-site.xml which is present under the same location i.e /etc/hadoop inside hadoop installation directory and create the Namenode/Datanode directories under hadoop user home directory.

$ mkdir -p hdfs/namenode
$ mkdir -p hdfs/datanode
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Configuration For hdfs-site.xml File
Configuration For hdfs-site.xml File.

Configuration Changes in mapred-site.xml file

Copy the mapred-site.xml from mapred-site.xml.template using cp command and then edit the mapred-site.xml placed in /etc/hadoop under hadoop instillation directory with the following changes.

$ cp mapred-site.xml.template mapred-site.xml
Creating the new mapred-site.xml File
Creating the new mapred-site.xml File.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configuration For mapred-site.xml File
Configuration For mapred-site.xml File.

Configuration Changes in yarn-site.xml file



Edit yarn-site.xml with the following entries.

<configuration>
<property>
<name>mapreduceyarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Configuration For yarn-site.xml File
Configuration For yarn-site.xml File.

Starting the Hadoop Cluster

Format the namenode before using it for the first time. As HDFS user run the below command to format the Namenode.

$ hdfs namenode -format
Format the Namenode
Format the Namenode.


Once the Namenode has been formatted then start the HDFS using the start-dfs.sh script.

Starting the DFS Startup Script to start HDFS
Starting the DFS Startup Script to start HDFS.

To start the YARN services you need to execute the yarn start script i.e. start-yarn.sh

Starting the YARN Startup Script to start YARN
Starting the YARN Startup Script to start YARN.

To verify all the Hadoop services/daemons are started successfully you can use the jps command.

/opt/jdk1.8.0_192/bin/jps
20035 SecondaryNameNode
19782 DataNode
21671 Jps
20343 NodeManager
19625 NameNode
20187 ResourceManager
Hadoop Daemons Output from JPS Command
Hadoop Daemons Output from JPS Command.

Now we can check the current Hadoop version you can use below command :

$ hadoop version
or
$ hdfs version
Check Hadoop Version
Check Hadoop Version.

HDFS Command Line Interface



To access the HDFS and create some directories top of DFS you can use HDFS CLI.

$ hdfs dfs -mkdir /test
$ hdfs dfs -mkdir /hadooponubuntu
$ hdfs dfs -ls /
HDFS Directory Creation using HDFS CLI
HDFS Directory Creation using HDFS CLI.

Access the Namenode and YARN from Browser

You can access the both the Web UI for NameNode and YARN Resource Manager via any of the browsers like Google Chrome/Mozilla Firefox.

Namenode Web UI - http://<hadoop cluster hostname/IP address>:50070

Namenode Web User Interface
Namenode Web User Interface.
HDFS Details from Namenode Web User Interface
HDFS Details from Namenode Web User Interface.


HDFS Directory Browsing via Namenode Web User Interface
HDFS Directory Browsing via Namenode Web User Interface.

The YARN Resource Manager (RM) web interface will display all running jobs on current Hadoop Cluster.

Resource Manager Web UI - http://<hadoop cluster hostname/IP address>:8088

Resource Manager Web User Interface
Resource Manager Web User Interface.

Conclusion

The world is changing the way it is operating currently and Big-data is playing an major role in this phase. Hadoop is a framework that makes our lif easy while working on large sets of data. There are improvements on all the fronts. The future is exciting.

ARE YOU LOOKING FOR A LINUX JOB?
Submit your RESUME or create a JOB ALERT on LinuxCareers.com job portal.
DO YOU NEED ADDITIONAL HELP?
Get extra help by visiting our LINUX FORUM or simply use comments below.

You may also be interested in:



Comments and Discussions