Kalyan Hadoop Training in Hyderabad @ ORIEN IT, Ameerpet, 040 65142345 , 9703202345

Showing posts sorted by relevance for query hadoop. Sort by date Show all posts

Wednesday 9 October 2013

HADOOP FS SHELL COMMANDS

HADOOP FS SHELL COMMANDS EXAMPLES - TUTORIALS

Hadoop file system (fs) shell commands are used to perform various file operations like copying file, changing permissions, viewing the contents of the file, changing ownership of files, creating directories etc.

The syntax of fs shell command is

hadoop fs <args>

All the fs shell commands takes the path URI as arguments. The format of URI is sheme://authority/path. The scheme and authority are optional. For hadoop the scheme is hdfs and for local file system the scheme is file. IF you do not specify a scheme, the default scheme is taken from the configuration file. You can also specify the directories in hdfs along with the URI as hdfs://namenodehost/dir1/dir2 or simple /dir1/dir2.

The hadoop fs commands are almost similar to the unix commands. Let see each of the fs shell commands in detail with examples:

Hadoop fs Shell Commands

hadoop fs ls:

The hadoop ls command is used to list out the directories and files. An example is shown below:

> hadoop fs -ls /user/hadoop/employees
Found 1 items
-rw-r--r--   2 hadoop hadoop 2 2012-06-28 23:37 /user/hadoop/employees/000000_0

The above command lists out the files in the employees directory.

> hadoop fs -ls /user/hadoop/dir
Found 1 items
drwxr-xr-x   - hadoop hadoop  0 2013-09-10 09:47 /user/hadoop/dir/products

The output of hadoop fs ls command is almost similar to the unix ls command. The only difference is in the second field. For a file, the second field indicates the number of replicas and for a directory, the second field is empty.

hadoop fs lsr:

The hadoop lsr command recursively displays the directories, sub directories and files in the specified directory. The usage example is shown below:

> hadoop fs -lsr /user/hadoop/dir
Found 2 items
drwxr-xr-x   - hadoop hadoop  0 2013-09-10 09:47 /user/hadoop/dir/products
-rw-r--r--   2 hadoop hadoop    1971684 2013-09-10 09:47 /user/hadoop/dir/products/products.dat

The hadoop fs lsr command is similar to the ls -R command in unix.

hadoop fs cat:

Hadoop cat command is used to print the contents of the file on the terminal (stdout). The usage example of hadoop cat command is shown below:

> hadoop fs -cat /user/hadoop/dir/products/products.dat

cloudera book by amazon
cloudera tutorial by ebay

hadoop fs chgrp:

hadoop chgrp shell command is used to change the group association of files. Optionally you can use the -R option to change recursively through the directory structure. The usage of hadoop fs -chgrp is shown below:

hadoop fs -chgrp [-R] <NewGroupName> <file or directory name>

hadoop fs chmod:

The hadoop chmod command is used to change the permissions of files. The -R option can be used to recursively change the permissions of a directory structure. The usage is shown below:

hadoop fs -chmod [-R] <mode | octal mode> <file or directory name>

hadoop fs chown:

The hadoop chown command is used to change the ownership of files. The -R option can be used to recursively change the owner of a directory structure. The usage is shown below:

hadoop fs -chown [-R] <NewOwnerName>[:NewGroupName] <file or directory name>

hadoop fs mkdir:

The hadoop mkdir command is for creating directories in the hdfs. You can use the -p option for creating parent directories. This is similar to the unix mkdir command. The usage example is shown below:

> hadoop fs -mkdir /user/hadoop/hadoopdemo

The above command creates the hadoopdemo directory in the /user/hadoop directory.

> hadoop fs -mkdir -p /user/hadoop/dir1/dir2/demo

The above command creates the dir1/dir2/demo directory in /user/hadoop directory.

hadoop fs copyFromLocal:

The hadoop copyFromLocal command is used to copy a file from the local file system to the hadoop hdfs. The syntax and usage example are shown below:

Syntax:
hadoop fs -copyFromLocal <localsrc> URI

Example:

Check the data in local file
> ls sales
2000,iphone
2001, htc

Now copy this file to hdfs

> hadoop fs -copyFromLocal sales /user/hadoop/hadoopdemo

View the contents of the hdfs file.

> hadoop fs -cat /user/hadoop/hadoopdemo/sales
2000,iphone
2001, htc

hadoop fs copyToLocal:

The hadoop copyToLocal command is used to copy a file from the hdfs to the local file system. The syntax and usage example is shown below:

Syntax
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Example:

hadoop fs -copyToLocal /user/hadoop/hadoopdemo/sales salesdemo

The -ignorecrc option is used to copy the files that fail the crc check. The -crc option is for copying the files along with their CRC.

hadoop fs cp:

The hadoop cp command is for copying the source into the target. The cp command can also be used to copy multiple files into the target. In this case the target should be a directory. The syntax is shown below:

hadoop fs -cp /user/hadoop/SrcFile /user/hadoop/TgtFile
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 hdfs://namenodehost/user/hadoop/TgtDirectory

hadoop fs -put:

Hadoop put command is used to copy multiple sources to the destination system. The put command can also read the input from the stdin. The different syntaxes for the put command are shown below:

Syntax1: copy single file to hdfs

hadoop fs -put localfile /user/hadoop/hadoopdemo

Syntax2: copy multiple files to hdfs

hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdemo

Syntax3: Read input file name from stdin
hadoop fs -put - hdfs://namenodehost/user/hadoop/hadoopdemo

hadoop fs get:

Hadoop get command copies the files from hdfs to the local file system. The syntax of the get command is shown below:

hadoop fs -get /user/hadoop/hadoopdemo/hdfsFileName localFileName

hadoop fs getmerge:

hadoop getmerge command concatenates the files in the source directory into the destination file. The syntax of the getmerge shell command is shown below:

hadoop fs -getmerge <src> <localdst> [addnl]

The addnl option is for adding new line character at the end of each file.

hadoop fs moveFromLocal:

The hadoop moveFromLocal command moves a file from local file system to the hdfs directory. It removes the original source file. The usage example is shown below:

> hadoop fs -moveFromLocal products /user/hadoop/hadoopdemo

hadoop fs mv:

It moves the files from source hdfs to destination hdfs. Hadoop mv command can also be used to move multiple source files into the target directory. In this case the target should be a directory. The syntax is shown below:

hadoop fs -mv /user/hadoop/SrcFile /user/hadoop/TgtFile
hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2 hdfs://namenodehost/user/hadoop/TgtDirectory

hadoop fs du:

The du command displays aggregate length of files contained in the directory or the length of a file in case its just a file. The syntax and usage is shown below:

hadoop fs -du hdfs://namenodehost/user/hadoop

hadoop fs dus:

The hadoop dus command prints the summary of file lengths

> hadoop fs -dus hdfs://namenodehost/user/hadoop
hdfs://namenodehost/user/hadoop 21792568333

hadoop fs expunge:

Used to empty the trash. The usage of expunge is shown below:

hadoop fs -expunge

hadoop fs rm:

Removes the specified list of files and empty directories. An example is shown below:

hadoop fs -rm /user/hadoop/file

hadoop fs -rmr:

Recursively deletes the files and sub directories. The usage of rmr is shown below:

hadoop fs -rmr /user/hadoop/dir

hadoop fs setrep:

Hadoop setrep is used to change the replication factor of a file. Use the -R option for recursively changing the replication factor.

hadoop fs -setrep -w 4 -R /user/hadoop/dir

hadoop fs stat:

Hadoop stat returns the stats information on a path. The syntax of stat is shown below:

hadoop fs -stat URI

> hadoop fs -stat /user/hadoop/
2013-09-24 07:53:04

hadoop fs tail:

Hadoop tail command prints the last kilobytes of the file. The -f option can be used same as in unix.

> hafoop fs -tail /user/hadoop/sales.dat

12345 abc
2456 xyz

hadoop fs test:

The hadoop test is used for file test operations. The syntax is shown below:

hadoop fs -test -[ezd] URI

Here "e" for checking the existence of a file, "z" for checking the file is zero length or not, "d" for checking the path is a directory or no. On success, the test command returns 1 else 0.

hadoop fs text:

The hadoop text command displays the source file in text format. The allowed source file formats are zip and TextRecordInputStream. The syntax is shown below:

hadoop fs -text <src>

hadoop fs touchz:

The hadoop touchz command creates a zero byte file. This is similar to the touch command in unix. The syntax is shown below:

hadoop fs -touchz /user/hadoop/filename

Thursday 31 July 2014

Big Data Basics - Part 3 - Overview of Hadoop

Problem

I have read the previous tips on Introduction to Big Data and Architecture of Big Data and I would like to know more about Hadoop. What are the core components of the Big Data ecosystem? Check out this tip to learn more.

Solution

Before we look into the architecture of Hadoop, let us understand what Hadoop is and a brief history of Hadoop.

What is Hadoop?

Hadoop is an open source framework, from the Apache foundation, capable of processing large amounts of heterogeneous data sets in a distributed fashion across clusters of commodity computers and hardware using a simplified programming model. Hadoop provides a reliable shared storage and analysis system.

The Hadoop framework is based closely on the following principle:
In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers. ~Grace Hopper

History of Hadoop

Hadoop was created by Doug Cutting and Mike Cafarella. Hadoop has originated from an open source web search engine called "Apache Nutch", which is part of another Apache project called "Apache Lucene", which is a widely used open source text search library.

The name Hadoop is a made-up name and is not an acronym. According to Hadoop's creator Doug Cutting, the name came about as follows.

"The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term."

Architecture of Hadoop

Below is a high-level architecture of multi-node Hadoop Cluster.

Here are few highlights of the Hadoop Architecture:

Hadoop works in a master-worker / master-slave fashion.
Hadoop has two core components: HDFS and MapReduce.
HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, even on a commodity hardware, by replicating the data across multiple nodes. Unlike a regular file system, when data is pushed to HDFS, it will automatically split into multiple blocks (configurable parameter) and stores/replicates the data across various datanodes. This ensures high availability and fault tolerance.
MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes and takes care of coordination and consolidation of results.
The master contains the Namenode and Job Tracker components.
- Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop Cluster.
- Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.
Each Worker / Slave contains the Task Tracker and a Datanode components.
- Task Tracker is responsible for running the task / computation assigned to it.
- Datanode is responsible for holding the data.
The computers present in the cluster can be present in any location and there is no dependency on the location of the physical server.

Characteristics of Hadoop

Here are the prominent characteristics of Hadoop:

Hadoop provides a reliable shared storage (HDFS) and analysis system (MapReduce).
Hadoop is highly scalable and unlike the relational databases, Hadoop scales linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers.
Hadoop is very cost effective as it can work with commodity hardware and does not require expensive high-end hardware.
Hadoop is highly flexible and can process both structured as well as unstructured data.
Hadoop has built-in fault tolerance. Data is replicated across multiple nodes (replication factor is configurable) and if a node goes down, the required data can be read from another node which has the copy of that data. And it also ensures that the replication factor is maintained, even if a node goes down, by replicating the data to other available nodes.
Hadoop works on the principle of write once and read multiple times.
Hadoop is optimized for large and very large data sets. For instance, a small amount of data like 10 MB when fed to Hadoop, generally takes more time to process than traditional systems.

When to Use Hadoop (Hadoop Use Cases)

Hadoop can be used in various scenarios including some of the following:

Analytics
Search
Data Retention
Log file processing
Analysis of Text, Image, Audio, & Video content
Recommendation systems like in E-Commerce Websites

When Not to Use Hadoop

There are few scenarios in which Hadoop is not the right fit. Following are some of them:

Low-latency or near real-time data access.
If you have a large number of small files to be processed. This is due to the way Hadoop works. Namenode holds the file system metadata in memory and as the number of files increases, the amount of memory required to hold the metadata increases.
Multiple writes scenario or scenarios requiring arbitrary writes or writes between the files.

For more information on Hadoop framework and the features of the latest Hadoop release, visit the Apache Website: http://hadoop.apache.org.
There are few other important projects in the Hadoop ecosystem and these projects help in operating/managing Hadoop, Interacting with Hadoop, Integrating Hadoop with other systems, and Hadoop Development. We will take a look at these items in the subsequent tips.

Next Steps

Explore more about Big Data and Hadoop
In the next and subsequent tips, we will see what is HDFS, MapReduce, and other aspects of Big Data world. So stay tuned!

Tuesday 21 October 2014

Ganglia configuration for a small Hadoop cluster and some troubleshooting

Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.

Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.

Basic Ganglia overview

Ganglia consists of three components:

Ganglia monitoring daemon (gmond) – a daemon which needs to run on every single node that is monitored. It collects local monitoring metrics and announce them, and (if configured) receives and aggregates metrics sent to it from other gmonds (and even from itself).
Ganglia meta daemon (gmetad) – a daemon that polls from one or more data sources (a data source can be a gmond or other gmetad) periodically to receive and aggregate the current metrics. The aggregated results are stored in database and can be exported as XML to other clients – for example, the web frontend.
Ganglia PHP web frontend – it retrieves the combined metrics from the meta daemon and displays them in form of nice, dynamic HTML pages containing various real-time graphs.

If you want to learn more about gmond, gmetad and the web frontend, a very good description is available at Ganglia’s wikipedia page. Hope, that following picture (showing an exemplary configuration) helps to understand the idea:

In this post I will rather focus on configuration of Ganglia. If you are using Debian, please refer to thefollowing tutorial to install it (just typing a couple of commands). We use Ganglia 3.1.7 in this post.

Ganglia for a small Hadoop cluster

While Ganglia is scalable, distributed and can monitor hundreds and even thousands of nodes, small clusters can also benefit from it (as well as developers and administrators, since Ganglia is a great empirical way to learn Hadoop and HBase internals). In this post I would like to describe how we configured Ganglia on a five-node cluster (1 masters and 4 slaves) that runs Hadoop and HBase. I believe that 5-node cluster (or similar size) is a typical configuration that many companies and organizations start using Hadoop with.

Please note, that Ganglia is flexible enough to be configured in many ways. Here, I will simply describe what final effect I wanted to achieve and how it was done.

Our monitoring requirements can be specified as follows:

easily get metrics from every single node
easily get agregated metrics for all slave nodes (so that we will know how much resources is used by MapReduce jobs and HBase operations)
easily get agregated metrics for all master nodes (so far we have only one master, but when the cluster grows, we will move some master deamons (e.g JobTracker, Secondary Namenode) to separate machines)
easily get agregated metrics for all nodes (to get overall state of the cluster)

It means that I want Ganglia to see the cluster as two “logical” subclusters e.g. “masters” and “slaves”. Basically, I wish to see pages like this one:

Possible Ganglia’s configuration

Here is an illustrative picture which shows simple Ganglia’s configuration for 5-node Hadoop cluster that meets our all requirements may look like. So let’s configure it in this way!

Please note, that we would like to keep as many default settings as possible. By default:

gmond communicates on UDP port 8649 (specified in udp_send_channel andudp_recv_channel sections in gmond.conf)
gmetad downloads metrics on TCP port 8649 (specified in tcp_accept_channel section ingmond.conf, and in data_source entry in gmetad.conf)

However, one setting will be changed. We set the communication method between gmonds to be unicast UDP messages (instead of multicast UDP messages). Unicast has following advantages over multicast: it is better for a larger cluster (say a cluster with more than a hundred of nodes) and it is supported in the Amazon EC2 environment (unlike multicast).

Ganglia monitoring daemon (gmond) configuration

According to the picture above:

Every node runs a gmond.

Slaves subcluster configuration

Each gmond on slave1, slave2, slave3 and slave4 nodes defines udp_send_channel to send metrics to slave1 (port 8649)
gmond on slave1 defines udp_recv_channel (port 8649) to listen to incoming metrics andtcp_accept_channel (port 8649) to announce them. This means this gmond is the “lead-node” for this subcluster and collects all metrics sent via UDP (port 8649) by all gmonds from slave nodes (even from itself), which can be polled later via TCP (port 8649) by gmetad.

Masters subcluster configuration

gmond on master node defines udp_send_channel to send data to master (port 8649),udp_recv_channel (port 8649) and tcp_accept_channel (port 8649). This means it becomes the “lead node” for this one-node cluster and collects all metrics from itself and exposes them to gmetad.

The configuration should be specified in gmond.conf file (you may find it in /etc/ganglia/).

gmond.conf for slave1 (only the most important settings included):

cluster {
  name = "hadoop-slaves"
  ...
}
udp_send_channel {
  host = slave1.node.IP.address
  port = 8649
}
udp_recv_channel {
  port = 8649
}
tcp_accept_channel {
  port = 8649
}

gmond.conf for slave2, slave3, slave4 (actually, the same gmond.conf file as for slave1 can be used as well):

cluster {
  name = "hadoop-slaves"
  ...
}
udp_send_channel {
  host = slave1.node.IP.address
  port = 8649
}
udp_recv_channel {}
tcp_accept_channel {}

The gmond.conf file for the master node should be similar to slave1′s gmond.conf file – just replace slave1′s IP address with master’s IP and set cluster name to “hadoop-masters”.

You can read more about gmond‘s configuration sections and attributes here.

Ganglia meta daemon (gmetad)

gmetad configuration is even simpler:

Master runs gmetad
gmetad defines two data sources:

data_source "hadoop-masters" master.node.IP.address
data_source "hadoop-slaves" slave1.node.IP.address

The configuration should be specified in gmetad.conf file (you may find it in /etc/ganglia/).

Hadoop and HBase integration with Ganglia

Hadoop and HBase use GangliaContext class to send the metrics collected by each daemon (such as datanode, tasktracker, jobtracker, HMaster etc) to gmonds.

Once you have setup Ganglia successfully, you may want to edit /etc/hadoop/conf/hadoop-metrics.properties and /etc/hbase/conf/hadoop-metrics.properties to announce Hadoop and HBase-related metric to Ganglia. Since we use CDH 4.0.1 which is compatible with Ganglia releases 3.1.x, we use newly introduced GangliaContext31 (instead olderGangliaContext class) in properties files.

Metrics configuration for slaves

# /etc/hadoop/conf/hadoop-metrics.properties
...
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=hadoop-slave1.IP.address:8649
...
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=hadoop-slave1.IP.address:8649
...

Metrics configuration for master

Should be the same as for slaves – just use hadoop-master.IP.address:8649 (instead of hadoop-slave1.IP.address:8649) for example:

# /etc/hbase/conf/hadoop-metrics.properties
...
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=hadoop-master.IP.address:8649
...

Remember to edit both properties files (/etc/hadoop/conf/hadoop-metrics.properties for Hadoop and /etc/hbase/conf/hadoop-metrics.properties for HBase) on all nodes and then restart Hadoop and HBase clusters. No further configuration is necessary.

Some more details

Actually, I was surprised that Hadoop’s deamons really send data somewhere, instead of just being polled for this data. What does it mean? It means, for example, that every single slave node runs several processes (e.g. gmond, datanode, tasktracker and regionserver) that collect the metrics and send them to gmond running on slave1 node. If we stop gmonds on slave2, slave3 and slave4, but still run Hadoop’s daemons, we will still get metrics related to Hadoop (but do not get metrics about memory, cpu usage as they were to be send by stopped gmonds). Please look at slave2 node in the picture bellow to see (more or less) how it works (tt, dd and rs denotes tasktracker, datanode and regionserver respectively, while slave4 was removed in order to increase readability).

Single points of failure

This configuration works well until nodes starts to fail. And we know that they will! And we know that, unfortunately, our configuration has at least two single points of failure (SPoF):

gmond on slave1 (if this node fails, all monitoring statistics about all slave nodes will be unavailable)
gmetad and the web frontend on master (if this node fails, the full monitoring system will be unavailable. It means that we not only loose the most important Hadoop node (actually, it should be called SUPER-master since it has so many master daemons installed ;), but we also loose the valuable source of monitoring information that may help us detect the cause of failure by looking at graphs and metrics for this node that were generated just a moment before the failure)

Avoiding Ganglia’s SPoF on slave1 node

Fortunately, you may specify as many udp_send_channels as you like to send metrics redundantly to other gmonds (assuming that these gmonds specify udp_recv_channels to listen to incoming metrics).

In our case, we may select slave2 to be also additional lead node (together with slave1) to collect metrics redundantly (and announce to them to gmetad

update gmond.conf on all slave nodes and define additional udp_send_channel section to send metrics to slave2 (port 8649)
update gmond.confs on slave2 to define udp_recv_channel (port 8649) to listen to incoming metrics and tcp_accept_channel (port 8649) to announce them (the same settings should be already set in gmond.confs on slave1)
update hadoop-metrics.properties file for Hadoop and HBase daemons running on slave nodes to send their metrics to both slave1 and slave2 e.g.:

# /etc/hbase/conf/hadoop-metrics.properties
...
hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
hbase.period=10
hbase.servers=hadoop-slave1.IP.address:8649,hadoop-slave2.IP.address:8649

finally update data_source “hadoop-slaves” in gmetad.conf to poll data from two redundant gmonds (if gmetad cannot pull the data from slave1.node.IP.address, it will continue trying slave2.node.IP.address):

data_source "hadoop-slaves" slave1.node.IP.address slave2.node.IP.address

Perhaps the picture bellow is not fortunate (so many arrows), but it intends to say that if slave1 fails, then gmetad will be able to take metrics from gmond on slave2 node (since all slave nodes send metrics redundantly to gmonds running on slave1 and slave2).

Avoiding Ganglia’s SPoF on master node

The main idea here is not to collocate gmetad (and the web frontend) with Hadoop master daemons, so that we will not loose monitoring statistics if the master node fails (or simply become unavailable). One idea is to, for example, move gmetad (and the web frontend) from slave1 to slave3 (or slave4) or simply introduce a redundant gmetad running on slave3 (or slave4). The former idea seems to be quite OK, while the later makes things quite complicated for such a small cluster.

I guess that even better idea is to introduce an additional node (called “edge” node, if possible) that runs gmetad and the web frontend (it may also have base Hadoop and HBase packages installed, but it does not run any Hadoop’s daemons – it acts as a client machine only to launch MapReduce jobs and access HBase). Actually, the “edge” node is commonly used practice to provide the interface between users and the cluster (e.g. it runs Pig and Hive, Oozie).

Troubleshooting and tips that may help

Since debugging various aspects of the configuration was the longest part of setting up Ganglia, I share some tips here. Note that is does not cover all possible troubleshooting, but it is rather based on problems that we have encountered and finally managed to solve.

Start small

Although the process configuration of Ganglia is not so complex, it is good to start with only two nodes and if it works, grew that to a larger cluster. But before, you install any Ganglia’s daemon…

Try to send “Hello” from node1 to node2

Make sure that you can talk to port 8649 on the given target host using UDP protocol. netcat is a simple tool, that helps you to verify it. Open port 8649 on node1 (called the “lead node” later) for inbound UDP connections, and then send some text to it from node2.

# listen (-l option) for inbound UDP (-u option) connections on port 8649 
# and prints received data
akawa@hadoop-slave1:~$ nc -u -l -p 8649

# create a UDP (-u option) connection to hadoop-slave1:8649 
# and send text from stdin to that node:
akawa@hadoop-slave2:~$ nc -u hadoop-slave1 8649
Hello My Lead Node

# look at slave1's console to see if the text was sucessfully delivered
akawa@hadoop-slave1:~$
Hello My Lead Node

If it does not work, please double check whether your iptables rules (iptables, or ip6tables if you use IPv6) opens port 8649 for both UDP and TCP connections.

Let gmond send some data to another gmond

Install gmond on two nodes and verify if one can send its metrics to another using UDP connection on port 8649. You may use following settings in gmond.conf file for both nodes:

cluster {
  name = "hadoop-slaves"
}
udp_send_channel {
  host = the.lead.node.IP.address
  port = 8649
}
udp_recv_channel {
  port = 8649
}
tcp_accept_channel {}

After running gmonds (sudo /etc/init.d/ganglia-monitor start), you can use lsof to check if the connection was established:

akawa@hadoop-slave1:~$ sudo lsof -i :8649
COMMAND   PID    USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
gmond   48746 ganglia    4u  IPv4 201166172      0t0  UDP *:8649 
gmond   48746 ganglia    5u  IPv4 201166173      0t0  TCP *:8649 (LISTEN)
gmond   48746 ganglia    6u  IPv4 201166175      0t0  UDP hadoop-slave1:35702->hadoop-slave1:8649

akawa@hadoop-slave2:~$ sudo lsof -i :8649
COMMAND   PID    USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
gmond   31025 ganglia    6u  IPv4 383110679      0t0  UDP hadoop-slave2:60789->hadoop-slave1:8649

To see if any data is actually sent to the lead node, use tcpdump to dump network traffic on port 8649:

akawa@hadoop-slave1:~$ sudo tcpdump -i eth-pub udp port 8649
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth-pub, link-type EN10MB (Ethernet), capture size 65535 bytes
18:08:02.236625 IP hadoop-slave2.60789 > hadoop-slave1.8649: UDP, length 224
18:08:02.236652 IP hadoop-slave2.60789 > hadoop-slave1.8649: UDP, length 52
18:08:02.236661 IP hadoop-slave2.60789 > hadoop-slave1.8649: UDP, length 236

Use debug option

tcpdump shows that some data is transferred, but it does not tell you what kind of data is sent ;)

Hopefully, running gmond or gmetad in debugging mode gives us more information (since it does not run as a daemon in the debugging mode, so stop it simply using Ctrl+C)

akawa@hadoop-slave1:~$ sudo /etc/init.d/ganglia-monitor stop
akawa@hadoop-slave1:~$ sudo /usr/sbin/gmond -d 2
 
loaded module: core_metrics
loaded module: cpu_module
...
udp_recv_channel mcast_join=NULL mcast_if=NULL port=-1 bind=NULL
tcp_accept_channel bind=NULL port=-1
udp_send_channel mcast_join=NULL mcast_if=NULL host=hadoop-slave1.IP.address port=8649
 
 metric 'cpu_user' being collected now
 metric 'cpu_user' has value_threshold 1.000000
        ...............
 metric 'swap_free' being collected now
 metric 'swap_free' has value_threshold 1024.000000
 metric 'bytes_out' being collected now
 ********** bytes_out:  21741.789062
        ....
Counting device /dev/mapper/lvm0-rootfs (96.66 %)
Counting device /dev/mapper/360a980006467435a6c5a687069326462 (35.31 %)
For all disks: 8064.911 GB total, 5209.690 GB free for users.
 metric 'disk_total' has value_threshold 1.000000
 metric 'disk_free' being collected now
        .....
 sent message 'cpu_num' of length 52 with 0 errors
 sending metadata for metric: cpu_speed

We see that various metrics are collected and sent to host=hadoop-slave1.IP.address port=8649. Unfortunately, it only does not tell whether thy are delivered successfully since they were send over UDP…

Do not mix IPv4 and IPv6

Let’s have a look at a real situation, that we have encountered on our cluster (and which was the root cause of mysterious and annoying Ganglia misconfiguration). First, look at lsof results.

akawa@hadoop-slave1:~$ sudo  lsof -i :8649
COMMAND   PID    USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
gmond   38431 ganglia    4u  IPv4 197424417      0t0  UDP *:8649 
gmond   38431 ganglia    5u  IPv4 197424418      0t0  TCP *:8649 (LISTEN)
gmond   38431 ganglia    6u  IPv4 197424422      0t0  UDP hadoop-slave1:58304->hadoop-slave1:864913:56:33

akawa@ceon.pl: akawa@hadoop-slave2:~$ sudo  lsof -i :8649
COMMAND   PID    USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
gmond   23552 ganglia    6u  IPv6 382340910      0t0  UDP hadoop-slave2:36999->hadoop-slave1:8649

Here hadoop-slave2 sends metrics to hadoop-slave1 on right port and hadoop-slave1 listens to on right port as well. Everything is almost the same as at the snippets in the previous section, except one important detail – hadoop-slave2 sends over IPv6, but hadoop-slave1 reads over IPv4!

The initial guess was to update ip6tables (apart from iptables) rules to open port 8649 for both UDP and TCP connections over IPv6. But it did not work.

It worked when we changed hostname “hadoop-slave1.vls” to its IP addess in gmond.conf files (yes, earlier I used hostnames instead of IP addresses in every file).

Make sure, that your IP address is correctly resolved to a hostname, or vice versa.

Get cluster summary with gstat

If you managed to send send metrics from slave2 to slave1, it means your cluster is working. In Ganglia’s nomenclature, cluster is a set of hosts that share the same cluster name attribute ingmond.conf file e.g. “hadoop-slaves”. There is a useful provided by Ganglia called gstat that prints the list of hosts that are represented by a gmond running on a given node.

akawa@hadoop-slave1:~$ gstat --all
CLUSTER INFORMATION
       Name: hadoop-slaves
      Hosts: 2
Gexec Hosts: 0
 Dead Hosts: 0
  Localtime: Tue Aug 21 22:46:21 2012
 
CLUSTER HOSTS
Hostname                     LOAD                       CPU              Gexec
 CPUs (Procs/Total) [     1,     5, 15min] [  User,  Nice, System, Idle, Wio]
hadoop-slave2
   48 (    0/  707) [  0.01,  0.07,  0.09] [   0.1,   0.0,   0.1,  99.8,   0.0] OFF
hadoop-slave1
   48 (    0/  731) [  0.01,  0.06,  0.07] [   0.0,   0.0,   0.1,  99.9,   0.0] OFF

Check where gmetad polls metrics from

Run following command on the host that runs gmetad to check what clusters and host is it polling metrics from (you grep it somehow to display only useful lines):

akawa@hadoop-master:~$ nc localhost 8651 | grep hadoop
 
<GRID NAME="Hadoop_And_HBase" AUTHORITY="http://hadoop-master/ganglia/" LOCALTIME="1345642845">
<CLUSTER NAME="hadoop-masters" LOCALTIME="1345642831" OWNER="ICM" LATLONG="unspecified" URL="http://ceon.pl">
<HOST NAME="hadoop-master" IP="hadoop-master.IP.address" REPORTED="1345642831" TN="14" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1345632023">
<CLUSTER NAME="hadoop-slaves" LOCALTIME="1345642835" OWNER="ICM" LATLONG="unspecified" URL="http://ceon.pl">
<HOST NAME="hadoop-slave4" IP="..." REPORTED="1345642829" TN="16" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1345478489">
<HOST NAME="hadoop-slave2" IP="..." REPORTED="1345642828" TN="16" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1345581519">
<HOST NAME="hadoop-slave3" IP="..." REPORTED="1345642829" TN="15" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1345478489">
<HOST NAME="hadoop-slave1" IP="..." REPORTED="1345642833" TN="11" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1345572002">

Other issues (Last update: 2013-08-12)

Other issues that I saw using Ganglia are as follow:

There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused fixed thanks tohttp://viewsby.wordpress.com/2013/03/12/ganglia-error-collecting-data-127-0-0-18652-fsockopen-error-connection-refused/
PHP Fatal error: Allowed memory size of N bytes exhausted fixed thanks tohttp://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg02451.html

Alternatives

Since the monitoring of clusters is quite broad topic, there are many tools that helps you with this task. In case of Hadoop clusters, apart from Ganglia, you can find a number of other interesting alternatives. I will just shortly mention a couple of them.

Cloudera Manager 4 (Enterprise)

Apart from greatly simplifing the process of installation and configuration of Hadoop cluster, Cloudera Manager provides a couple of useful features to monitor and visualize dozens of Hadoop’s service performance metrics and information related to hosts including CPU, memory, disk usage and network I/O. Additionally, it alerts you when you approach critical thresholds (Ganglia itself does not provide alerts, but can be integrated with alerting systems such as Nagios and Hyperic).

You may learn more about the key features of Cloudera Manager here.

Cacti, Zabbix, Nagios, Hyperic

Please visit Cacti website to learn more about this tool. Here is also very interesting blog post aboutHadoop Graphing with Cacti.

Zabbix, Nagios and Hyperic are tools you may also want to look at.

Pages

Wednesday 9 October 2013

HADOOP FS SHELL COMMANDS

HADOOP FS SHELL COMMANDS EXAMPLES - TUTORIALS

Hadoop fs Shell Commands

Thursday 31 July 2014

Big Data Basics - Part 3 - Overview of Hadoop

Big Data Basics - Part 3 - Overview of Hadoop

Problem

Solution

What is Hadoop?

History of Hadoop

Architecture of Hadoop

Characteristics of Hadoop

When to Use Hadoop (Hadoop Use Cases)

When Not to Use Hadoop

Next Steps

Tuesday 21 October 2014

Ganglia configuration for a small Hadoop cluster and some troubleshooting

Basic Ganglia overview

Ganglia for a small Hadoop cluster

Possible Ganglia’s configuration

Ganglia monitoring daemon (gmond) configuration

Slaves subcluster configuration

Masters subcluster configuration

Ganglia meta daemon (gmetad)

Hadoop and HBase integration with Ganglia

Metrics configuration for slaves

Metrics configuration for master

Some more details

Single points of failure

Avoiding Ganglia’s SPoF on slave1 node

Avoiding Ganglia’s SPoF on master node

Troubleshooting and tips that may help

Start small

Try to send “Hello” from node1 to node2

Let gmond send some data to another gmond

Use debug option

Do not mix IPv4 and IPv6

Get cluster summary with gstat

Check where gmetad polls metrics from

Other issues (Last update: 2013-08-12)

Alternatives

Cloudera Manager 4 (Enterprise)

Cacti, Zabbix, Nagios, Hyperic