Apache Hadoop is comprised of multiple open source software packages that work together for distributed storage and distributed processing of big data. There are four main components to Hadoop:
- Hadoop Common – the various software libraries that Hadoop depends on to run
- Hadoop Distributed File System (HDFS) – a file system that allows for efficient distribution and storage of big data across a cluster of computers
- Hadoop MapReduce – used for processing the data
- Hadoop YARN – an API that manages the allocation of computing resources for the entire cluster
In this tutorial, we will go over the steps to install Hadoop version 3 on Ubuntu 20.04. This will involve installing HDFS (Namenode and Datanode), YARN, and MapReduce on a single node cluster configured in Pseudo Distributed Mode, which is distributed simulation on a single machine. Each component of Hadoop (HDFS, YARN, MapReduce) will run on our node as a separate Java process.
In this tutorial you will learn:
- How to add users for Hadoop Environment
- How to install Java prerequisite
- How to configure passwordless SSH
- How to install Hadoop and configure necessary related XML files
- How to start the Hadoop Cluster
- How to access NameNode and ResourceManager Web UI