Apache Hadoop is an open source framework used for distributed storage as well as distributed processing of big data on clusters of computers which runs on commodity hardwares. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides API for requesting and allocating resource in the Hadoop cluster.
The Apache Hadoop framework is composed of the following modules:
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- YARN
- MapReduce
This article explains how to install Hadoop Version 2 on Ubuntu 18.04. We will install HDFS (Namenode and Datanode), YARN, MapReduce on the single node cluster in Pseudo Distributed Mode which is distributed simulation on a single machine. Each Hadoop daemon such as hdfs, yarn, mapreduce etc. will run as a separate/individual java process.
In this tutorial you will learn:
- How to add users for Hadoop Environment
- How to install and configure the Oracle JDK
- How to configure passwordless SSH
- How to install Hadoop and configure necessary related xml files
- How to start the Hadoop Cluster
- How to access NameNode and ResourceManager Web UI
Software Requirements and Conventions Used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | Ubuntu 18.04 |
Software | Hadoop 2.8.5, Oracle JDK 1.8 |
Other | Privileged access to your Linux system as root or via the sudo command. |
Conventions |
# – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux commands to be executed as a regular non-privileged user |
Other Versions of this Tutorial
Add users for Hadoop Environment
Create the new user and group using the command :
# add user
Install and configure the Oracle JDK
Download and extract the Java archive under the /opt
directory.
# cd /opt # tar -xzvf jdk-8u192-linux-x64.tar.gz
or
$ tar -xzvf jdk-8u192-linux-x64.tar.gz -C /opt
To set the JDK 1.8 Update 192 as the default JVM we will use the following commands :
# update-alternatives --install /usr/bin/java java /opt/jdk1.8.0_192/bin/java 100 # update-alternatives --install /usr/bin/javac javac /opt/jdk1.8.0_192/bin/javac 100
After installation to verify the java has been successfully configured, run the following commands :
# update-alternatives --display java # update-alternatives --display javac
Configure passwordless SSH
Install the Open SSH Server and Open SSH Client with the command :
# sudo apt-get install openssh-server openssh-client
Generate Public and Private Key Pairs with the following command. The terminal will prompt for entering the file name. Press ENTER
and proceed. After that copy the public keys form id_rsa.pub
to authorized_keys
.
$ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Verify the password-less ssh configuration with the command :
$ ssh localhost
Install Hadoop and configure related xml files
Download and extract Hadoop 2.8.5 from Apache official website.
# tar -xzvf hadoop-2.8.5.tar.gz
Setting up the environment variables
Edit the bashrc
for the Hadoop user via setting up the following Hadoop environment variables :
export HADOOP_HOME=/home/hadoop/hadoop-2.8.5
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Source the .bashrc
in current login session.
$ source ~/.bashrc
Edit the hadoop-env.sh
file which is in /etc/hadoop
inside the Hadoop installation directory and make the following changes and check if you want to change any other configurations.
export JAVA_HOME=/opt/jdk1.8.0_192
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/home/hadoop/hadoop-2.8.5/etc/hadoop"}
Configuration Changes in core-site.xml file
Edit the core-site.xml
with vim or you can use any of the editors. The file is under /etc/hadoop
inside hadoop
home directory and add following entries.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadooptmpdata</value>
</property>
</configuration>
In addition, create the directory under hadoop
home folder.
$ mkdir hadooptmpdata
Configuration Changes in hdfs-site.xml file
Edit the hdfs-site.xml
which is present under the same location i.e /etc/hadoop
inside hadoop
installation directory and create the Namenode/Datanode
directories under hadoop
user home directory.
$ mkdir -p hdfs/namenode $ mkdir -p hdfs/datanode
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Configuration Changes in mapred-site.xml file
Copy the mapred-site.xml
from mapred-site.xml.template
using cp
command and then edit the mapred-site.xml
placed in /etc/hadoop
under hadoop
instillation directory with the following changes.
$ cp mapred-site.xml.template mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Configuration Changes in yarn-site.xml file
Edit yarn-site.xml
with the following entries.
<configuration>
<property>
<name>mapreduceyarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Starting the Hadoop Cluster
Format the namenode before using it for the first time. As HDFS user run the below command to format the Namenode.
$ hdfs namenode -format
Once the Namenode has been formatted then start the HDFS using the start-dfs.sh
script.
To start the YARN services you need to execute the yarn start script i.e. start-yarn.sh
To verify all the Hadoop services/daemons are started successfully you can use the jps
command.
/opt/jdk1.8.0_192/bin/jps
20035 SecondaryNameNode
19782 DataNode
21671 Jps
20343 NodeManager
19625 NameNode
20187 ResourceManager
Now we can check the current Hadoop version you can use below command :
$ hadoop version
or
$ hdfs version
HDFS Command Line Interface
To access the HDFS and create some directories top of DFS you can use HDFS CLI.
$ hdfs dfs -mkdir /test $ hdfs dfs -mkdir /hadooponubuntu $ hdfs dfs -ls /
Access the Namenode and YARN from Browser
You can access the both the Web UI for NameNode and YARN Resource Manager via any of the browsers like Google Chrome/Mozilla Firefox.
Namenode Web UI – http://<hadoop cluster hostname/IP address>:50070
The YARN Resource Manager (RM) web interface will display all running jobs on current Hadoop Cluster.
Resource Manager Web UI – http://<hadoop cluster hostname/IP address>:8088
Conclusion
The world is changing the way it is operating currently and Big-data is playing an major role in this phase. Hadoop is a framework that makes our lif easy while working on large sets of data. There are improvements on all the fronts. The future is exciting.