Sunday, July 27, 2014

Working with Flume and Analyzing Flume data with HIVE

FLUME INSTALLATION:
==================

Download latest FLUME tar ball from: 


Update the HADOOP_CLASSPATH in $HADOOP_HOME/hadoop-env.sh with flume-ng-core-1.5.0.jar file:
root@centos6:~ #grep -i HADOOP_CLASSPATH /home/bigdata/hadoop-1.2.1/conf/hadoop-env.sh 
# export HADOOP_CLASSPATH=
export HADOOP_CLASSPATH=/home/bigdata/apache-hive-0.13.1-bin/lib/*:/home/bigdata/apache-flume-1.5.0-bin/lib/flume-ng-core-1.5.0.jar:/home/bigdata/hadoop-1.2.1/hadoop-core-1.2.1.jar:/home/bigdata/hadoop-1.2.1/HadoopProj                         
root@centos6:~ #

Update /etc/profile with:
export FLUME_HOME=/home/bigdata/apache-flume-1.5.0-bin
export PATH=$HIVE_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$FLUME_HOME/bin:$JAVA_HOME/bin:$PATH

SETUP CONFIGURATION FILE:
========================
Create configuration file $FLUME_HOME/conf 
The following conf file listens for file events, other events could be network etc, for list of events (source types) visit https://cwiki.apache.org/confluence/display/FLUME/Getting+Started :

root@centos6:~/apache-flume-1.5.0-bin #pwd
/home/bigdata/apache-flume-1.5.0-bin
root@databliz-centos6:~/apache-flume-1.5.0-bin #cat conf/flume-conf.conf 
agent1.sources = s1
agent1.channels = c1
agent1.sinks = k1

# Define source and type of event (in this case it is exec)
agent1.sources.s1.type = exec
agent1.sources.s1.command = tail -f /tmp/esplog.log

# Define channel and type (in this case channel is stored in memory, other types are: file, database etc)
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

# Define sink and type (in this case it is hdfs, i.e an event type result is stored in hdfs)
# Default fileType outputs Writable (LongWritable) contents.
agent1.sinks.k1.type = hdfs
agent1.sinks.k1.hdfs.path = hdfs://<IP>:<Port>/user/flume/esplog.log
agent1.sinks.k1.hdfs.fileType=DataStream

# Bind source and sink to a channel
agent1.sources.s1.channels = c1
agent1.sinks.k1.channel = c1

root@centos6:~/apache-flume-1.5.0-bin #

RUN THE FLUME:
==============
root@centos6:~/apache-flume-1.5.0-bin #bin/flume-ng agent --conf ./conf --conf-file conf/flume-conf.conf --name agent1 -Dflume.root.logger=INFO,console

We can also use the debug option:-Dflume.root.logger=DEBUG,console

Each time any new entry is added to /tmp/esplog.log flume generates a file in hdfs://<IP>:<Port>/user/flume/esplog.log/

Example: 
root@centos6:~ #hadoop dfs -cat /user/flume/esplog.log/FlumeData.1406475505975
Warning: $HADOOP_HOME is deprecated.

spi messages
spi messages
conn messages
conn messages
SA messages
ISKAMP messages
conn5007 messages
conn5008 messages
root@databliz-centos6:~ #

ANALYZE FLUME DATA WITH hive:
=============================
hive> create table flumedata (col1 string, col2 string)
    > row format delimited fields terminated by ' ';
OK
Time taken: 0.935 seconds

hive> load data inpath '/user/flume/esplog.log/FlumeData.1406475505975' into table flumedata;

hive> select * from flumedata;
OK
flumedata.col1  flumedata.col2
spi     messages
spi     messages
conn    messages
conn    messages
SA      messages
ISKAMP  messages
conn5007        messages
conn5008        messages
Time taken: 0.521 seconds, Fetched: 8 row(s)
hive> 

Monday, July 7, 2014

Map-Reduce implementation Detailed

Introduction:
This document describes the map and reduce in hadoop how specific operations completed. If you Google 's MapReduce are not familiar with a variety of modes, 
please refer to the MapReduce -. http://labs.google COM . / papers / mapreduce html
Map
Since Map is parallel to the input file set to operate, so it's the first step (FileSplit) is set to split the file into several subsets, if a single file as large as it has affected the search efficiency, it will be divided into small bodies.It is noted that this step is split does not know the internal logical structure of the input file, for example, to conduct a logical division of the text file will be divided in an arbitrary byte boundaries, so this particular division to themselves to specify also may have been used hadoop define a few simple segmentation. Each file is then divided body would correspond to a new map task.
When a single map task starts, it will reduce over each configuration task to open a new output Writer (writer). Followed it (writer) will be used from the specified specific In the resulting putFormat in Record Reader to read it file divided body. InputFormat class analysis of the input file and generates key-value key-value pairs. while InputFormat boundary will need to be addressed at the time of recording to file segmentation. For example TextInputFormat reads file segmentation boundaries have split the last line of the body, if the body is read when the split is not the first time, TextInputFormat ignores the first line of the content . 
InputFormat class does not need to produce some of the keys to a meaningful right. For example, the default output is TextInputFormat class line of input text content value, row offset key - the majority of applications only use but rarely used offset.
Passed to the user to configure the mapper keys are read from RecordReader, the user provides Mapper class can be any of the keys on the operation and then call OutputCollector.collect method to re-gather after their own definition of the key pair. The output must be generated by a Key class and a value class, this is because the output is to be written to disk Map to SequenceFile the form, which includes information for each file type and all the records are of similar shape ( If you want to output different data structures you can inherit subcategories out). Map key input and output need not linked to the type.
When the output Mapper is collected, they will be Partiti One r class distinction in the manner specified written to the output file. The default is HashPartitionerclass with the hashcode hash function key generated class to distinguish (and therefore should have a good hash function, it can make in the various reduce the load evenly balanced task). Details can be viewed MapTask class. N inputs can generate M map tasks to run, each map task will reduce the number of tasks configured to generate output files. Each output file will be for a specific task simultaneously reduce all keys generated from the map tasks will be to reduce on the inside. So in a particular reduce task for a given key value pairs will all be processed.
Combine
When the output of its key map operation has been in existence for them in memory. For performance and efficiency considerations have sometimes reduce function of the synthesizer is good. If there is a synthesizer, then map the keys will not be written immediately to the output, they will be collected in the list, a key value of a list, when writing a certain number of key-value pairs, which is part of the buffer will be sent to the synthesizer, all the value of each key will be sent to all the synth reduce method as in the original key and the output of the same map.
For example, hadoop case the word count program, its output is a map operation (word, 1) key-value pairs in the input word count can be used to speed up this operation synthesizer. Collection and processing of a synthetic operation lists in memory, a word of a list. When a certain number of key-value pairs output into memory, it calls reduce synthesized operations, each with a unique word for key, values​​is a list iterator. Then synthesizer output (word, count-in-this-part-of-the-input) key pairs. Reduce operating from the point of view Map synthesizer output also has the same information, but this will reduce much more than the original hard disk read and write.
Reduce
When a reduce task starts, its input is dispersed on each node of the map output file. If, in a distributed model, they need to be copied to the local file copy step in the system on. Details can be viewed ReduceTaskRunner class
Once all the data are in the local effective, it will add an added step in the document. Then the file will be merged classification so that the same key value pairs can row together (classification step). This allows true reduce operation simple, the file will be read sequentially into the value (values) from the input file passed to reduce an iterator method - until the next key. Details can be viewed ReduceTask class.
Finally, the output consists of output files for each reduce task. Their surface can be specified by JobConf.setOutputFormat class format, if used JobConf.setOutputFormat class, then the output of the key classes and value classes must specify.

Wednesday, July 2, 2014

How does Hadoop developers submit their Mapreduce jobs to Cluster?


Using Jumbune tool a Hadoop developer can do below jobs:

  1. MR Job Profiling 
  2. HDFS Data Validator 
  3. MR Job flow Debugger


Using Jumbune tool a Hadoop Admin can do below jobs:

  1. Hadoop cluster Monitoring
  2. MR Job profiling

It is really easy learn about these jobs if you are familiar with Mapreduce programming and Job submition to Hadoop cluster before.

I have found videos in Youtube with below links here, very interesting to learn and apply.

Developer Videos:

MapReduce Job Profiling - Developer
https://www.youtube.com/watch?v=vkiiEnJDMi0

Jumbune HDFS Validation - Developer
https://www.youtube.com/watch?v=1CXjfNBSL_s

DataValidator  - Developer
https://www.youtube.com/watch?v=kBIP4BVIjRQ

Jumbune MapReduce Flow Debugger - Developer
https://www.youtube.com/watch?v=3T4xWtIiS_Q

Admin Videos:

Jumbune Cluster Monitoring - Administrator
https://www.youtube.com/watch?v=Q8ToDsGRBIE

Jumbune MapReduce Profiler - Developer, Administrator
https://www.youtube.com/watch?v=BRt2sHu5804