Whats going on HADOOP with BIGDATA?: September 2014

Data ingesting tool - flume

Source - Web logs, HTTP, REST servers ,Avro, Thrift, Syslog, Netcat

Channels - Memory buffer or File or DB or Storage

Sink - Target / Destination

To import mutliple data sources to HDFS, Agents are used.

Each agent consists Source, Channel and Sink

-------------------------------------------------------------------------
Running Flume Command from Flume Terminal [as below]:
-------------------------------------------------------------------------
Flume > bin/flume-ng
agent --conf ./conf/
-f conf/flume-conf.conf
-Dflume.root.logger=DEBUG, console -n agent1

-----------------------------------------------------
File content of flume-conf.conf [as below]:-
-----------------------------------------------------

agent1.sources = s1
agent1.channels = c1
agent1.sinks = k1

# Define source and type of event (in this case it is exec)
agent1.sources.s1.type = exec
agent1.sources.s1.command = tail -f /tmp/esplog.log

# Define channel and type (in this case channel is stored in memory, other types are: file, database etc)
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100

# Define sink and type (in this case it is hdfs, i.e an event type result is stored in hdfs)
# Default fileType outputs Writable (LongWritable) contents.
agent1.sinks.k1.type = hdfs
agent1.sinks.k1.hdfs.path = hdfs://<IP>:<Port>/user/flume/esplog.log
agent1.sinks.k1.hdfs.fileType=DataStream

# Bind source and sink to a channel
agent1.sources.s1.channels = c1
agent1.sinks.k1.channel = c1

Q1: Which service is responsible for writing the data into datanodes, is it the Client that contacts each Datanode for writing the data or the Namenode?
Ans: Client seeks datanode report from NameNode, the list of Datanodes forms the pipeline. It copies data to nearest datanode through DataStreamer, and also maintains an internal queue called "data queue", the data is first streamed to nearest Datanode and it forwards the data to next Datanode through the pipeline.

Q2: If a Namenode crashes while data processing is in progress, after recovery will it be able to provide entire data or no data?
Ans: Depends on amount of data already written into fsimage in Secondary Namenode.

Q3:How do you achieve NameNode high availability?
Ans: YARN has a mechanism to configure multiple Namenodes.
The HDFS HA feature addresses the NN data problems by providing the option of running two NameNodes in the same cluster, in an Active/Passive configuration. These are referred to as the Active NameNode and the Standby NameNode. Unlike the Secondary NameNode, the Standby NameNode is hot standby, allowing a fast failover to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the purpose of planned maintenance. You cannot have more than two NameNodes.

Q4: Where does sorting and shuffling occur in MapReduce? Mapper or Reducer?

Ans: Sorting happens in both Mapper and Reducer. Shuffling only implemented in Mapper

Q5: What is secondary sort in MapReduce.
Ans: After processing all Map tasks and the intermediate data written to local FS in Key Value format and all keys are sorted format, but Values are not in sorted format. If Values related to the keys also required to be in sorted passion, then this sorting is called as 'Secondary sort'.

Q6: Will the size of input split depend on block size?
Ans: InputSplit can be Larger or Smaller than a block size.
size of input split = Max ( Min.SplitSize, Min (Max. Integer Value, DFS block size) )

Q7: The input file contains a one single line of 100MB stored in hdfs with default block size set to 64MB, how will you process it using MapReduce?
Ans: Your Mapreduce Jar file sit upon the i/p data blocks, first. In this case Jar file sits on 64MB block and 36MB file to start processing it. 36MB block finishes processing it earlier as it have less data and writes output in KV format to the Local File system that means Reducer location. Until all other blocks results in KV format were copied into the Reducer location, reducer method cannot be invoked. Once all KV format Intermediate results were written Reducer method invoked and writes output to HDFS again.

Q8: Can you redirect the output key values of mapper to a specific reducer?
Ans: YES, using custom partitions.

Q9: What is structured data and unstructured data? A csv file , data copied from a Database are structured or unstructured data?
Ans: CSV is Structured, as it got structure while written data into Database, before.

Q10: How do you process a real time data generated at a constant rate and volume using Hadoop?

Ans: Using Sqoop commands, we run cron jobs, once in a day, to Import data from source systems to Cluster. The no. of imports in a day should be lesser, is recommended. Analysis on newly appended Data [Per day] may give you Updated reports/results in the end.

Q11: What will happen if you try to Load data into non-existing partition of a partitioned hive table?
Ans: Exception
Q12: What will happen if you try to Load data into non-existing bucket of a clustered hive table?
Ans: Exception
Q13: What are different types of joins supported by hive?
Ans: Inner, LeftOuter, RightOuter, SemiJoin, MapJoins...etc
Q14: The input log file contains group of lines for a give exception, how do implement the MapReduce such that each mapper processes the given exception and its group of lines that form the stack trace?
Ans: Implement Custom FileInputFormat and set IsSplittable value to 'False', Which does whole file as input to single mapper task. Here no. of mappers are only one.

Q15: The hive table contains:
UID SessionID MsgID Date
For a given data there can be multiple SessionIDs for e.g.:
123 Session123 abc123 26-Aug-2014
123 Session123 abc123 26-Aug-2014
123 Session124 abc123 26-Aug-2014
123 Session125 abc123 26-Aug-2014

How do you clean this data to eradicate the duplicates, such that each tuple is unique?
Ans: Make hive table partitioned based on SessionID column

Q16: A large log file contains different message types, each message should be parsed into different files, which of the following best suits this requirement?

1.) MapReduce job with number of Reduce tasks set to number of message types

2.) MapReduce job with enums defined for each message type.

3.) MapReduce job set with only Mapper tasks.

4.) MapReduce job set with Partitioner.

Ans. 4.

Q17: 1 GB input file is copied into HDFS , how many mappers can be invoked with default configurations on this file?

1.) 24 Mappers 2.) 10 Mappers 3.) 16 Mappers 4.) 1 Mapper

Ans. 16 Mappers, for default block size of 64 MB, the 1 GB file will be split into 16 * 64 MB (1024).

Q18: Which one is recommended for Hadoop :

1.) Large files spread on many nodes.

2.) Many small files spread on multiple nodes.

3.) Archived small files (HAR) spread on mutiple nodes.

4.) 1 & 3

5.) None.

Ans. 4

Q19: Default Schedulers in Hadoop

1.) Fair Scheduler 2.) FIFO Scheduler 3.) Capacity Scheduler 4.) None.

Ans. 2

Q20: Can a Single Node cluster have more than 1 replication factor set

1.) Yes 2.) No

Ans. Yes

Q21: Which Hadoop eco-system will you recommend for reading data from different sources into HDFS?

1.) Sqoop 2.) Flume 3.) MapReduce 4.) None.

Ans. Flume, we can define various sources and relevant types such as exec, DB etc and redirect it to links as HDFS.

Q22: Which class will load data between Hive and HBase?

Ans: HBaseStorageHandler

Q23:

1. Is it possible to have more than one Mapper class defined for the MapReduce job?

2. Can we supply more than one input file for the MapReduce job?

Ans: Yes, to both questions, take a scenario wherein the MapReduce jar is processing multiple large log files, each having their own format and text. We should define different Mapper Classes that has the logic to produce Key Value pairs based on input log file.

Also the driver class should use MultipleInputs class:

MultipleInputs.addInputPath(job, filepath, InputFormat.class, Mapper.class);

Q24. Why does my reducer shows x % of job started when the mapper is still in execution?

Ans: Ideally reducer can only start after mapper finishes, the % shown on the screen is amount of files it copied from the mapper but does not really mean the reducer operation.

Q25.What is SAFE MODE and what are the possible scenarios NameNode can get into SAFE MODE?

Ans. SAFE MODE is the sate wherein NameNode has not received prescribed number of block reports (99.9 % default) from the respective datanodes, possible scenarios are Datanode is corrupted, you added a new Datanode having configuration mismatch.

Q26: Is it mandatory to have SSH between Datanode and Namnode, though vice-versa is true though?

Ans. As per my experience the SSH is only needed by Hadoop to run the remote SSH commands from the master node to datanode (slaves), it may not be mandatory to have a reverse SSH between Slaves and Master.

Q27: Is the Secondary NameNode acts as a back up of NameNode?

Ans: No, Secondary Namenode (SNN) is only responsible for managing fsimage and edits file of NameNode, it doesn't act as a backup for NameNode (NN), in fact NN is a single point of failure in Gen 1 Hadoop, YARN though provides a way to configure multiple NameNodes.

Whats going on HADOOP with BIGDATA?

Tuesday, September 16, 2014

Apache Flume NG

Sqoop EOD Imports based on script in Cron Jobs

FAQs on Hadoop Bigdata

Q2: If a Namenode crashes while data processing is in progress, after recovery will it be able to provide entire data or no data?
Ans: Depends on amount of data already written into fsimage in Secondary Namenode.

Q4: Where does sorting and shuffling occur in MapReduce? Mapper or Reducer?

Ans: Sorting happens in both Mapper and Reducer. Shuffling only implemented in Mapper

Q6: Will the size of input split depend on block size?
Ans: InputSplit can be Larger or Smaller than a block size.
size of input split = Max ( Min.SplitSize, Min (Max. Integer Value, DFS block size) )

Q8: Can you redirect the output key values of mapper to a specific reducer?
Ans: YES, using custom partitions.

Q9: What is structured data and unstructured data? A csv file , data copied from a Database are structured or unstructured data?
Ans: CSV is Structured, as it got structure while written data into Database, before.

Q10: How do you process a real time data generated at a constant rate and volume using Hadoop?

Ans: Using Sqoop commands, we run cron jobs, once in a day, to Import data from source systems to Cluster. The no. of imports in a day should be lesser, is recommended. Analysis on newly appended Data [Per day] may give you Updated reports/results in the end.

Tuesday, September 16, 2014

Apache Flume NG

Sqoop EOD Imports based on script in Cron Jobs

FAQs on Hadoop Bigdata

Q2: If a Namenode crashes while data processing is in progress, after recovery will it be able to provide entire data or no data?Ans: Depends on amount of data already written into fsimage in Secondary Namenode.

Q4: Where does sorting and shuffling occur in MapReduce? Mapper or Reducer?

Ans: Sorting happens in both Mapper and Reducer. Shuffling only implemented in Mapper

Q6: Will the size of input split depend on block size?Ans: InputSplit can be Larger or Smaller than a block size. size of input split = Max ( Min.SplitSize, Min (Max. Integer Value, DFS block size) )

Q8: Can you redirect the output key values of mapper to a specific reducer?Ans: YES, using custom partitions.

Q9: What is structured data and unstructured data? A csv file , data copied from a Database are structured or unstructured data?Ans: CSV is Structured, as it got structure while written data into Database, before.

Q10: How do you process a real time data generated at a constant rate and volume using Hadoop?

Ans: Using Sqoop commands, we run cron jobs, once in a day, to Import data from source systems to Cluster. The no. of imports in a day should be lesser, is recommended. Analysis on newly appended Data [Per day] may give you Updated reports/results in the end.

Q2: If a Namenode crashes while data processing is in progress, after recovery will it be able to provide entire data or no data?
Ans: Depends on amount of data already written into fsimage in Secondary Namenode.

Q6: Will the size of input split depend on block size?
Ans: InputSplit can be Larger or Smaller than a block size.
size of input split = Max ( Min.SplitSize, Min (Max. Integer Value, DFS block size) )

Q8: Can you redirect the output key values of mapper to a specific reducer?
Ans: YES, using custom partitions.

Q9: What is structured data and unstructured data? A csv file , data copied from a Database are structured or unstructured data?
Ans: CSV is Structured, as it got structure while written data into Database, before.