Ubuntu 20.04 Hadoop

Apache Hadoop is comprised of multiple open source software packages that work together for distributed storage and distributed processing of big data. There are four main components to Hadoop:

  • Hadoop Common – the various software libraries that Hadoop depends on to run
  • Hadoop Distributed File System (HDFS) – a file system that allows for efficient distribution and storage of big data across a cluster of computers
  • Hadoop MapReduce – used for processing the data
  • Hadoop YARN – an API that manages the allocation of computing resources for the entire cluster

In this tutorial, we will go over the steps to install Hadoop version 3 on Ubuntu 20.04. This will involve installing HDFS (Namenode and Datanode), YARN, and MapReduce on a single node cluster configured in Pseudo Distributed Mode, which is distributed simulation on a single machine. Each component of Hadoop (HDFS, YARN, MapReduce) will run on our node as a separate Java process.

In this tutorial you will learn:

  • How to add users for Hadoop Environment
  • How to install Java prerequisite
  • How to configure passwordless SSH
  • How to install Hadoop and configure necessary related XML files
  • How to start the Hadoop Cluster
  • How to access NameNode and ResourceManager Web UI

Apache Hadoop on Ubuntu 20.04 Focal Fossa

Apache Hadoop on Ubuntu 20.04 Focal Fossa
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Installed Ubuntu 20.04 or upgraded Ubuntu 20.04 Focal Fossa
Software Apache Hadoop, Java
Other Privileged access to your Linux system as root or via the sudo command.
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

Create user for Hadoop environment



Hadoop should have its own dedicated user account on your system. To create one, open a terminal and type the following command. You’ll also be prompted to create a password for the account.

$ sudo adduser hadoop
Create new Hadoop user

Create new Hadoop user

Install the Java prerequisite

Hadoop is based on Java, so you’ll need to install it on your system before being able to use Hadoop. At the time of this writing, the current Hadoop version 3.1.3 requires Java 8, so that’s what we will be installing on our system.

Use the following two commands to fetch the latest package lists in apt and install Java 8:

$ sudo apt update
$ sudo apt install openjdk-8-jdk openjdk-8-jre

Configure passwordless SSH



Hadoop relies on SSH to access its nodes. It will connect to remote machines through SSH as well as your local machine if you have Hadoop running on it. So, even though we are only setting up Hadoop on our local machine in this tutorial, we still need to have SSH installed. We also have to configure passwordless SSH
so that Hadoop can silently establish connections in the background.

  1. We’ll need both the OpenSSH Server and OpenSSH Client package. Install them with this command:
    $ sudo apt install openssh-server openssh-client
    
  2. Before continuing further, it’s best to be logged into the hadoop user account we created earlier. To change users in your current terminal, use the following command:
    $ su hadoop
    
  3. With those packages installed, it’s time to generate public and private key pairs with the following command. Note that the terminal will prompt you several times, but all you’ll need to do is keep hitting ENTER to proceed.
    $ ssh-keygen -t rsa
    
    Generating RSA keys for passwordless SSH

    Generating RSA keys for passwordless SSH
  4. Next, copy the newly generated RSA key in id_rsa.pub over to authorized_keys:
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    


  5. You can make sure that the configuration was successful by SSHing into localhost. If you are able to do it without being prompted for a password, you’re good to go.
    SSHing into the system without being prompted for password means it worked

    SSHing into the system without being prompted for password means it worked

Install Hadoop and configure related XML files

Head over to Apache’s website to download Hadoop. You may also use this command if you want to download the Hadoop version 3.1.3 binary directly:

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz

Extract the download to the hadoop user’s home directory with this command:

$ tar -xzvf hadoop-3.1.3.tar.gz -C /home/hadoop

Setting up the environment variable

The following export commands will configure the required Hadoop environment variables on our system. You can copy and paste all of these to your terminal (you may need to change line 1 if you have a different version of Hadoop):

export HADOOP_HOME=/home/hadoop/hadoop-3.1.3
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Source the .bashrc file in current login session:

$ source ~/.bashrc

Next, we will make some changes to the hadoop-env.sh file, which can be found in the Hadoop installation directory under /etc/hadoop. Use nano or your favorite text editor to open it:

$ nano ~/hadoop-3.1.3/etc/hadoop/hadoop-env.sh


Change the JAVA_HOME variable to where Java is installed. On our system (and probably yours too, if you are running Ubuntu 20.04 and have followed along with us so far), we change that line to:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Change the JAVA_HOME environment variable

Change the JAVA_HOME environment variable

That will be the only change we need to make in here. You can save your changes to the file and close it.

Configuration changes in core-site.xml file

The next change we need to make is inside the core-site.xml file. Open it with this command:

$ nano ~/hadoop-3.1.3/etc/hadoop/core-site.xml

Enter the following configuration, which instructs HDFS to run on localhost port 9000 and sets up a directory for temporary data.



fs.defaultFS
hdfs://localhost:9000


hadoop.tmp.dir
/home/hadoop/hadooptmpdata

core-site.xml configuration file changes

core-site.xml configuration file changes


Save your changes and close this file. Then, create the directory in which temporary data will be stored:

$ mkdir ~/hadooptmpdata

Configuration changes in hdfs-site.xml file

Create two new directories for Hadoop to store the Namenode and Datanode information.

$ mkdir -p ~/hdfs/namenode ~/hdfs/datanode

Then, edit the following file to tell Hadoop where to find those directories:

$ nano ~/hadoop-3.1.3/etc/hadoop/hdfs-site.xml

Make the following changes to the hdfs-site.xml file, before saving and closing it:



dfs.replication
1
dfs.name.dir
file:///home/hadoop/hdfs/namenode
dfs.data.dir
file:///home/hadoop/hdfs/datanode

hdfs-site.xml configuration file changes

hdfs-site.xml configuration file changes

Configuration changes in mapred-site.xml file

Open the MapReduce XML configuration file with the following command:

$ nano ~/hadoop-3.1.3/etc/hadoop/mapred-site.xml

And make the following changes before saving and closing the file:



mapreduce.framework.name
yarn



mapred-site.xml configuration file changes

mapred-site.xml configuration file changes

Configuration changes in yarn-site.xml file

Open the YARN configuration file with the following command:

$ nano ~/hadoop-3.1.3/etc/hadoop/yarn-site.xml

Add the following entries in this file, before saving the changes and closing it:



mapreduceyarn.nodemanager.aux-services
mapreduce_shuffle

yarn-site configuration file changes

yarn-site configuration file changes

Starting the Hadoop cluster

Before using the cluster for the first time, we need to format the namenode. You can do that with the following command:

$ hdfs namenode -format
Formatting the HDFS NameNode

Formatting the HDFS NameNode


Your terminal will spit out a lot of information. As long as you don’t see any error messages, you can assume it worked.

Next, start the HDFS by using the start-dfs.sh script:

$ start-dfs.sh
Run the start-dfs.sh script

Run the start-dfs.sh script

Now, start the YARN services via the start-yarn.sh script:

$ start-yarn.sh
Run the start-yarn.sh script

Run the start-yarn.sh script

To verify all the Hadoop services/daemons are started successfully you can use the jps command. This will show all the processes currently using Java that are running on your system.

$ jps


Execute jps to see all Java dependent processes and verify Hadoop components are running

Execute jps to see all Java dependent processes and verify Hadoop components are running

Now we can check the current Hadoop version with either of the following commands:

$ hadoop version

or

$ hdfs version
Verifying Hadoop installation and current version

Verifying Hadoop installation and current version

HDFS Command Line Interface

The HDFS command line is used to access HDFS and to create directories or issue other commands to manipulate files and directories. Use the following command syntax to create some directories and list them:

$ hdfs dfs -mkdir /test
$ hdfs dfs -mkdir /hadooponubuntu
$ hdfs dfs -ls /
Interacting with the HDFS command line

Interacting with the HDFS command line

Access the Namenode and YARN from browser



You can access both the Web UI for NameNode and YARN Resource Manager via any browser of your choice, such as Mozilla Firefox or Google Chrome.

For the NameNode Web UI, navigate to http://HADOOP-HOSTNAME-OR-IP:50070

DataNode web interface for Hadoop

DataNode web interface for Hadoop

To access the YARN Resource Manager web interface, which will display all currently running jobs on the Hadoop cluster, navigate to http://HADOOP-HOSTNAME-OR-IP:8088

YARN Resource Manager web interface for Hadoop

YARN Resource Manager web interface for Hadoop

Conclusion

In this article, we saw how to install Hadoop on a single node cluster in Ubuntu 20.04 Focal Fossa. Hadoop provides us a wieldy solution to dealing with big data, enabling us to utilize clusters for storage and processing of our data. It makes our life easier when working with large sets of data with its flexible configuration and convenient web interface.