Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster

After “Building Hadoop for Oracle Linux 7 (2/13)”, it is somehow expected to deploy Hadoop on Oracle Linux 7. This third article of the series shows how to create a 3-node Hadoop cluster. Once the cluster up and running, it digs into the processes, logs and consoles. The last step of it is to execute a MapReduce sample job.

The setup is very basic. Once the 3 servers installed, it should not take more than a few minutes to deploy, configure and start Hadoop. The final infrastructure does not implement any of the security features, one would configure on most production environment. For instance, it keeps more advanced authentication and authorization mechanisms for later…

Architecture

You’ll find a schema of the final configuration below. The 3 servers are named yellow, pink and green. If you want to test it for yourself, you can use 3 Virtualbox VMs, each one configured with 2GB of RAM and Oracle Linux 7.

Hadoop Cluster

To speed up the configuration, it does not integrate with any Kerberos domain controller. There is no connection key and no data encryption. SELinux is disabled. All the network port are opened. Service high availability and backups are not used. It is not properly sized for memory, CPU or network. The underlying storage technology is not even mentioned. This is, somehow; the simplest Hadoop cluster you will ever meet : not convenient for production purpose at all!

System Prerequisites

System and network infrastructure requires a few configuration steps before you proceed with Hadoop:

  • Install the 3 servers with Oracle Linux 7. Make sure you install Java, openssh and rsync
  • Deactivate SELinux and open the firewall
  • IP connectivity should be operational as well as name resolution
  • A system user should be present to install Hadoop as non root. This article assumes it is named hadoop. Create an SSH equivalency between all the servers for that user
  • A file system must be created to hold Hadoop distributed File System (HDFS)

Sections below provide some details about these configurations.

Yellow, pink and green

Install Oracle Linux 7 on the 3 servers. You should install the required RPMs including openssh, rsync and Java SE 8 as described in the previous article.

SELinux and Firewalld

Change the firewall from the public zone to the trusted zone and open that zone for any traffic:

Edit /etc/selinux/config and change its content to disable SELinux. Once done, reboot the server:

IP Configuration and Name Resolution

Configure the network on the 3 servers and register all the IP Addresses and names in /etc/hosts for every machine. If you prefer to use a DNS server, make sure names can be resolved from everywhere.

The Hadoop user

Usually, users are created separately per installed components, i.e. o,r for HDFS and one for YARN. To make it simple, you can install everything as a single hadoop user:

SSH equivalencies

The script below show how to create SSH equivalencies for the Hadoop user. Run it from every server to create the SSH private keys and authorize connections on remote hosts:

Test equivalencies; the script below tests the 9 SSH combinations. It should not prompt for any password:

Hadoop Installation and Configuration

Installing and configuring Hadoop is straightforward:

  • Untar the distribution,
  • Set a few properties including server names, local storage for HDFS, Name Node address and Resource Manager host name.

Hadoop Installation

Untar Hadoop in /usr/local/hadoop. Before you proceed, make sure you’ve copied the distribution as /home/hadoop/hadoop-2.4.1.tar.gz on every server:

HDFS underlying storage

Unlike regular filesystems, HDFS relies itself on underlying local filesystems. With Oracle Linux 7, you may want to use XFS. However to make it simpler, this configuration only relies on 2 destinations in /, one for the data and one for Metadata. On yellow, the server holding HDFS Node Name, run:

You won’t need the metadata directory for now on Pink and Green:

Cluster Properties

The number of Hadoop parameters is quite significant. This configuration sets the very few that are mandatory for this simple cluster:

  • HDFS underlying directories
  • The Node Manager address and port
  • The Resource Manager server name
  • The list of slave servers

Note:
You can separate the Name Node and Resource Manager from the other slave servers. However, in order to have a large-enough configuration, Data Nodes and Node Managers are running on every server

By default, the configuration directory is located in etc/hadoop in Hadoop home. Files are XML files and the few changes that are needed are described below.

HDFS directories

Local directory names for HDFS data and metadata are stored in hdfs-site.xml. dfs.datanode.data.dir and dfs.namenode.name.dir are the properties to change for your configuration. Below is an example of that file for the sample cluster:

Node Name Access Point

Name Node address and port are defined by the fs.defaultFS property in core-site.xml. Below is an example of the file for the sample cluster:

Resource Manager Server

Resource Manager name is defined by the yarn.resourcemanager.hostname property in yarn-site.xml. Below is an example of the file for the sample cluster:

Slave servers

Slave names are kept in the slaves file. This file is not mandatory but allows to start all the slaves as a one-only command from the master node. In this sample cluster, the slaves file contains:

Configuration File Copies

Once the settings stored on one of the nodes, push them on the other ones. Assuming yellow was used to change the files, push the files to the 2 others nodes:

Note:
This is not the only way to perform the replication and one can rely on the HADOOP_MASTER variable and rsync to synchronize the configuration at the startup time.

Formatting HDFS

Last, format the HDFS filesystem from the Name Node or yellow in this case:
bin/hdfs namenode -format color

HDFS and YARN startup

You can start HDFS Name Node and Data Nodes from the Name Node. Run the 3 lines below from Yellow:

You can start YARN Resource Manager, Job History as well as all the Node Managers from the Resource Manager Node. Run the 4 commands below from Pink:

Hadoop Post-Installation Checks

A first level of checks for the configuration consists in making sure Hadoop java processes are running as expected:

To go deeper, check the log files in the log directories and connect to the web consoles. By default, the Name Node console listen on port 50070, here on yellow. The “Data Nodes” menu presents the running Data Nodes:Name Node ConsoleEvery Data Node has also a console listening to 50075:Data Node ConsoleYARN Resource Manager listen to port 8088 by default. The “Nodes” menu shows the Node Managers that are running:
Yarn ConsoleThe Job History Console can be accessed from port 19888:Job History ConsoleNode Manager consoles are listening to port 8042:Node Manager Console

Sample MapReduce Job

To check the cluster, execute a MapReduce job. To begin, create a /user/hadoop directory in HDFS:

Then copy the configuration directory in an HDFS input directory. That directory will, by default, be stored in /user/hadoop:

Run the “grep” command that comes with Hadoop distribution as an example. That program is a MapReduce Job that search a directory for a Regular Expression and count the number of matching expressions. The result is stored in another directory in HDFS:

Check the output directory has been created and the results stored in the part-r-00000 file:

To finish with the demo, delete all HDFS directories:

HDFS and YARN shutdown

To stop YARN, stop the Resource Manager and the Node Managers on all the slave servers. You only need to connect to Pink as Hadoop:

To stop HDFS, stop the Name Node and the Data Nodes on all the slave servers. You only need to connect to Yellow as Hadoop:

In this article, you’ve installed, configured and tested a 3 node Hadoop clusters. The next ones will explore the different components from HDFS to Spark…

References:

To know more about Hadoop installation, refer to
[1] Hadoop MapReduce Next Generation – Cluster Setup

Gregory Guillou

About Gregory Guillou

Gregory Guillou has written 768 post in this blog.

Senior Technical Architect at Easyteam

One thought on “Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster

  1. David Whitedavidw

    Hi Greg, may I suggest you set SELINUX to Permissive rather than just disabled, like that one day we’ll all be using the thing and it could avoid an issue we had recently on a dev server where disabling selinux seemed to have a nasty consequence on the software RAID: we couldn’t get to one of the disks any more…

    Reply

Leave a Reply