Apache Hadoop is comprised of multiple open source software packages that work together for distributed storage and distributed processing of big data. There are four main components to Hadoop:
- Hadoop Common – the various software libraries that Hadoop depends on to run
- Hadoop Distributed File System (HDFS) – a file system that allows for efficient distribution and storage of big data across a cluster of computers
- Hadoop MapReduce – used for processing the data
- Hadoop YARN – an API that manages the allocation of computing resources for the entire cluster
In this tutorial, we will go over the steps to install Hadoop version 3 on Ubuntu 20.04. This will involve installing HDFS (Namenode and Datanode), YARN, and MapReduce on a single node cluster configured in Pseudo Distributed Mode, which is distributed simulation on a single machine. Each component of Hadoop (HDFS, YARN, MapReduce) will run on our node as a separate Java process.
In this tutorial you will learn:
- How to add users for Hadoop Environment
- How to install Java prerequisite
- How to configure passwordless SSH
- How to install Hadoop and configure necessary related XML files
- How to start the Hadoop Cluster
- How to access NameNode and ResourceManager Web UI
|Category||Requirements, Conventions or Software Version Used|
|System||Installed Ubuntu 20.04 or upgraded Ubuntu 20.04 Focal Fossa|
|Software||Apache Hadoop, Java|
|Other||Privileged access to your Linux system as root or via the
# – requires given linux commands to be executed with root privileges either directly as a root user or by use of
$ – requires given linux commands to be executed as a regular non-privileged user
Create user for Hadoop environment
Hadoop should have its own dedicated user account on your system. To create one, open a terminal and type the following command. You’ll also be prompted to create a password for the account.
$ sudo adduser hadoop
Install the Java prerequisite
Hadoop is based on Java, so you’ll need to install it on your system before being able to use Hadoop. At the time of this writing, the current Hadoop version 3.1.3 requires Java 8, so that’s what we will be installing on our system.
Use the following two commands to fetch the latest package lists in
apt and install Java 8:
$ sudo apt update $ sudo apt install openjdk-8-jdk openjdk-8-jre
Configure passwordless SSH
Hadoop relies on SSH to access its nodes. It will connect to remote machines through SSH as well as your local machine if you have Hadoop running on it. So, even though we are only setting up Hadoop on our local machine in this tutorial, we still need to have SSH installed. We also have to configure passwordless SSH
so that Hadoop can silently establish connections in the background.
- We’ll need both the OpenSSH Server and OpenSSH Client package. Install them with this command:
$ sudo apt install openssh-server openssh-client
- Before continuing further, it’s best to be logged into the
hadoopuser account we created earlier. To change users in your current terminal, use the following command:
$ su hadoop
- With those packages installed, it’s time to generate public and private key pairs with the following command. Note that the terminal will prompt you several times, but all you’ll need to do is keep hitting
$ ssh-keygen -t rsa
- Next, copy the newly generated RSA key in
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- You can make sure that the configuration was successful by SSHing into localhost. If you are able to do it without being prompted for a password, you’re good to go.
Install Hadoop and configure related XML files
Head over to Apache’s website to download Hadoop. You may also use this command if you want to download the Hadoop version 3.1.3 binary directly:
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
Extract the download to the
hadoop user’s home directory with this command:
$ tar -xzvf hadoop-3.1.3.tar.gz -C /home/hadoop
Setting up the environment variable
export commands will configure the required Hadoop environment variables on our system. You can copy and paste all of these to your terminal (you may need to change line 1 if you have a different version of Hadoop):
export HADOOP_HOME=/home/hadoop/hadoop-3.1.3 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
.bashrc file in current login session:
$ source ~/.bashrc
Next, we will make some changes to the
hadoop-env.sh file, which can be found in the Hadoop installation directory under
/etc/hadoop. Use nano or your favorite text editor to open it:
$ nano ~/hadoop-3.1.3/etc/hadoop/hadoop-env.sh
JAVA_HOME variable to where Java is installed. On our system (and probably yours too, if you are running Ubuntu 20.04 and have followed along with us so far), we change that line to:
That will be the only change we need to make in here. You can save your changes to the file and close it.
Configuration changes in core-site.xml file
The next change we need to make is inside the
core-site.xml file. Open it with this command:
$ nano ~/hadoop-3.1.3/etc/hadoop/core-site.xml
Enter the following configuration, which instructs HDFS to run on localhost port 9000 and sets up a directory for temporary data.
fs.defaultFS hdfs://localhost:9000 hadoop.tmp.dir /home/hadoop/hadooptmpdata
Save your changes and close this file. Then, create the directory in which temporary data will be stored:
$ mkdir ~/hadooptmpdata
Configuration changes in hdfs-site.xml file
Create two new directories for Hadoop to store the Namenode and Datanode information.
$ mkdir -p ~/hdfs/namenode ~/hdfs/datanode
Then, edit the following file to tell Hadoop where to find those directories:
$ nano ~/hadoop-3.1.3/etc/hadoop/hdfs-site.xml
Make the following changes to the
hdfs-site.xml file, before saving and closing it:
dfs.replication 1 dfs.name.dir file:///home/hadoop/hdfs/namenode dfs.data.dir file:///home/hadoop/hdfs/datanode
Configuration changes in mapred-site.xml file
Open the MapReduce XML configuration file with the following command:
$ nano ~/hadoop-3.1.3/etc/hadoop/mapred-site.xml
And make the following changes before saving and closing the file:
Configuration changes in yarn-site.xml file
Open the YARN configuration file with the following command:
$ nano ~/hadoop-3.1.3/etc/hadoop/yarn-site.xml
Add the following entries in this file, before saving the changes and closing it:
Starting the Hadoop cluster
Before using the cluster for the first time, we need to format the namenode. You can do that with the following command:
$ hdfs namenode -format
Your terminal will spit out a lot of information. As long as you don’t see any error messages, you can assume it worked.
Next, start the HDFS by using the
Now, start the YARN services via the
To verify all the Hadoop services/daemons are started successfully you can use the
jps command. This will show all the processes currently using Java that are running on your system.
Now we can check the current Hadoop version with either of the following commands:
$ hadoop version
$ hdfs version
HDFS Command Line Interface
The HDFS command line is used to access HDFS and to create directories or issue other commands to manipulate files and directories. Use the following command syntax to create some directories and list them:
$ hdfs dfs -mkdir /test $ hdfs dfs -mkdir /hadooponubuntu $ hdfs dfs -ls /
Access the Namenode and YARN from browser
You can access both the Web UI for NameNode and YARN Resource Manager via any browser of your choice, such as Mozilla Firefox or Google Chrome.
For the NameNode Web UI, navigate to
To access the YARN Resource Manager web interface, which will display all currently running jobs on the Hadoop cluster, navigate to
In this article, we saw how to install Hadoop on a single node cluster in Ubuntu 20.04 Focal Fossa. Hadoop provides us a wieldy solution to dealing with big data, enabling us to utilize clusters for storage and processing of our data. It makes our life easier when working with large sets of data with its flexible configuration and convenient web interface.