Hadoop for DBAs (4/13): Introduction to HDFS

hadoop4_arkzoyd

HDFS is Hadoop Distributed File System and provides High Availability through Replication and Automatic Failover. This 4th article of the Hadoop for DBAs series digs into HDFS concepts. It is probably the funniest part to learn for DBAs. It is also the closest to what you are used to. There are dozens of features to be interested in and learn. This article focuses on some of the concepts and under the hood. It also demonstrates HDFS High Availability for DataNode and NameNode. Its 3 sections are independent and demonstrate: o A First Look at HDFS Underlying Structures; o HDFS Replication and DataNode Failure; o Quorum Journal Manager and NameNode Failure.

HDFS is a file system and, in some way, something you are very used to. It provides parameters to bypass caches or to force data in caches. You can impact block distribution, geo-distribution. You can also impact the level of replication. It is built to feed processes running in parallel. It can be accessed in many different ways, including CLI, Java, C or REST API. It provides advanced features like snapshots and rolling upgrades. You have the ability to move data dynamically and also extend or shrink your cluster. It is bluffing!

On the other hand, you probably don’t know anything like HDFS already. It could be because there are not that many distributed file systems anyway. Plus, it is written in Java to store blocks of files across thousands of servers. It is built for large files written by one process at a time. It still only supports data to be appended to existing files and it relies on local file systems on every nodes. So, it is not something you would have needed anyway, unless you want to deal with a at least a few terabytes of data and don’t want to invest in proprietary hardware and software. But enough of the overview, let us get into the inside… And before you go ahead, make sure you have installed and started HDFS as described in Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster.

A First Look at HDFS Underlying Structures

As you’ve figured out by installing and starting it, HDFS relies on 2 main components:

  • The NameNode manages metadata including directories and files, files to blocks mappings, ongoing and some past operations as well as the status of datanodes
  • DataNodes store, as you can expect, data blocks that are chunks of files

The number of operations performed by the NameNode are kept minimal so that HDFS can scale out without making the infrastructure more complex by adding namespaces. As a result, clients get and send data directly to DataNodes. The diagram below shows how a client interacts with HDFS when storing a file:

  1. It coordinates with the NameNode to get and store metadata. Once done it sends the file blocks to a DataNode
  2. DataNode replicates data to other DataNodes which send feedback to the NameNode so that it can keep track of metadata

hdfs put command
To check details of those operations clean up your HDFS filesystem and restart the namenode. You will find a fsImage file in the metadata directory. From the yellow server, run:

Once the namenode restarted, you should find a fsImage file to be used in case of a crash. That file contains an image of the HDFS files and directories. You can use the Offline Image Viewer or oiv to check the content of it:

Put a file in HDFS to figure out what will happen, you can run hdfs dfs -put from pink:

Once the file in HDFS, you can check how it is stored with hdfs fsck:

Stop the NameNode again and check the content of the metadata directory. The edit files keep tracks of the operations managed by the NameNode. You can use the Offline Edit Viewer to understand what is managed by the NameNode and how/when operations are performed on DataNodes:

As you can guess from above default block size are 128M, except for the last block of the file. If you look for those blocks in the underlying filesystem, you should find them named “blk_<blockid>” like below:

HDFS Replication and DataNode Loss

As you can see from above the default replication factor for every file block is 3. Because we have a 3 node cluster, it won’t help to demonstrate blocks are automatically rebalanced in the event of a server crash. To begin with this section, change the level of replication to 2:

  • Stop HDFS NameNode and DataNodes
  • Add a dfs.replication property in etc/hadoop/hdfs-site.xml and set it to 2
  • Remove the content of the data and metadata directories
  • Format the new file system
  • Start back HDFS NameNode and DataNodes

As you can see above there is only 2 copies of each block. Stop one DataNode that contains some blocks. From that example, we will stop yellow (192.168.56.3):

As you can see from hdfs dfsadmin -report the failed DataNode is not marked as dead right away. By default, it takes 10 minutes and 30 seconds to be detected:

During that period of time, if you want to get some files, the client might report some exceptions as you can see below; however, it should be able to get the file from block copies:

After a while, the missing datanode should be reported as dead:

File blocks are re-balanced to the remaining DataNodes so that you don’t any get exception when reading the file:

If you restart the failed DataNode, the NameNode will report it back alive right away and each block will have 3 copies:

NameNode Failure and QJM

In the previous configuration, the NameNode remains a single point of failure, especially if disks are local to servers and not secured by any kind of RAID/LVM. Keep in mind that HDFS metadata are only accessible from the NameNode server. To secure that configuration, HDFS provides a way to replicate metadata to a standby NameNode. For now and, at least up to release 2.5.0, there can only be 2 NameNodes per configuration: one active and one standby. In case of a server loss or to manage server maintenance, you can failover the active and standby NameNodes. That replication Infrastructure is named Quorum Journal Manager and it works as described below:

  1. The active NameNode sends edits to JounalNodes so that metadata can be maintained by other NameNode
  2. Standby NameNodes pull edits from JournalNodes and maintain a copy of metadata
  3. DataNodes send not only to the active NameNode but to all NameNodes in the configuration so that a standby node can be set active by a simple failover command

Quorum Journal Manager

NameNode Standby Configuration

In order to configure the Quorum Journal Manager for manual failover, proceed as below:

  • Stop the NameNode and DataNodes

  • Create the Journal Directory on every node

  • Create the metadata directory on green

  • Make sure fuser is installed on every one of your servers

  • Add the following properties to etc/hadoop/hdfs-site.xml

Note:
For details about those parameters, refer to HDFS High Availability Using the Quorum Journal Manager.

  • Change core-site.xml fs.defaultFS property to match the cluster name

  • Distribute those 2 files to all the nodes of the cluster

  • Start the journalnode and datanode processes

  • Format the journalnode from yellow

  • Start the NameNode on yellow

  • Configure the NameNode on green to synchronize with the primary NameNode

  • Start the NameNode on green

  • Check the status on both NameNodes:

  • Activate the one of the namenode

  • Test HDFS; the client should be able to access HDFS

Manual Failover

If you lose the active NameNode from yellow, you won’t be able to access hdfs anymore:

However, you can easily failover to the standby node with a one-only command:

The client should be able to use HDFS back again

Startup and Shutdown Procedure

Once you’ve configured QJM, starting and stopping the whole HDFS cluster differ. In order to stop the whole cluster, connect to yellow and run:

To restart HDFS, run:

Conclusion

This 4th part of the series shows how HDFS works. It also shows how DataNode replication behaves and How to setup and test Quorum Journal Manager to handle manual failover of NameNode. There are many other aspects of HDFS you may want to dig into, like the use of Zookeeper to automate NameNode failovers or the configuration of rack for replication. Next week you’ll start developing MapReduce Jobs with this colorcluster. Stay tuned!

Reference
To know more about HDFS, refer to HDFS Users Guide.

Gregory Guillou

About Gregory Guillou

Gregory Guillou has written 768 post in this blog.

Senior Technical Architect at Easyteam

One thought on “Hadoop for DBAs (4/13): Introduction to HDFS

Leave a Reply