Monday, July 7, 2014

Map-Reduce implementation Detailed

Introduction:
This document describes the map and reduce in hadoop how specific operations completed. If you Google 's MapReduce are not familiar with a variety of modes, 
please refer to the MapReduce -. http://labs.google COM . / papers / mapreduce html
Map
Since Map is parallel to the input file set to operate, so it's the first step (FileSplit) is set to split the file into several subsets, if a single file as large as it has affected the search efficiency, it will be divided into small bodies.It is noted that this step is split does not know the internal logical structure of the input file, for example, to conduct a logical division of the text file will be divided in an arbitrary byte boundaries, so this particular division to themselves to specify also may have been used hadoop define a few simple segmentation. Each file is then divided body would correspond to a new map task.
When a single map task starts, it will reduce over each configuration task to open a new output Writer (writer). Followed it (writer) will be used from the specified specific In the resulting putFormat in Record Reader to read it file divided body. InputFormat class analysis of the input file and generates key-value key-value pairs. while InputFormat boundary will need to be addressed at the time of recording to file segmentation. For example TextInputFormat reads file segmentation boundaries have split the last line of the body, if the body is read when the split is not the first time, TextInputFormat ignores the first line of the content . 
InputFormat class does not need to produce some of the keys to a meaningful right. For example, the default output is TextInputFormat class line of input text content value, row offset key - the majority of applications only use but rarely used offset.
Passed to the user to configure the mapper keys are read from RecordReader, the user provides Mapper class can be any of the keys on the operation and then call OutputCollector.collect method to re-gather after their own definition of the key pair. The output must be generated by a Key class and a value class, this is because the output is to be written to disk Map to SequenceFile the form, which includes information for each file type and all the records are of similar shape ( If you want to output different data structures you can inherit subcategories out). Map key input and output need not linked to the type.
When the output Mapper is collected, they will be Partiti One r class distinction in the manner specified written to the output file. The default is HashPartitionerclass with the hashcode hash function key generated class to distinguish (and therefore should have a good hash function, it can make in the various reduce the load evenly balanced task). Details can be viewed MapTask class. N inputs can generate M map tasks to run, each map task will reduce the number of tasks configured to generate output files. Each output file will be for a specific task simultaneously reduce all keys generated from the map tasks will be to reduce on the inside. So in a particular reduce task for a given key value pairs will all be processed.
Combine
When the output of its key map operation has been in existence for them in memory. For performance and efficiency considerations have sometimes reduce function of the synthesizer is good. If there is a synthesizer, then map the keys will not be written immediately to the output, they will be collected in the list, a key value of a list, when writing a certain number of key-value pairs, which is part of the buffer will be sent to the synthesizer, all the value of each key will be sent to all the synth reduce method as in the original key and the output of the same map.
For example, hadoop case the word count program, its output is a map operation (word, 1) key-value pairs in the input word count can be used to speed up this operation synthesizer. Collection and processing of a synthetic operation lists in memory, a word of a list. When a certain number of key-value pairs output into memory, it calls reduce synthesized operations, each with a unique word for key, values​​is a list iterator. Then synthesizer output (word, count-in-this-part-of-the-input) key pairs. Reduce operating from the point of view Map synthesizer output also has the same information, but this will reduce much more than the original hard disk read and write.
Reduce
When a reduce task starts, its input is dispersed on each node of the map output file. If, in a distributed model, they need to be copied to the local file copy step in the system on. Details can be viewed ReduceTaskRunner class
Once all the data are in the local effective, it will add an added step in the document. Then the file will be merged classification so that the same key value pairs can row together (classification step). This allows true reduce operation simple, the file will be read sequentially into the value (values) from the input file passed to reduce an iterator method - until the next key. Details can be viewed ReduceTask class.
Finally, the output consists of output files for each reduce task. Their surface can be specified by JobConf.setOutputFormat class format, if used JobConf.setOutputFormat class, then the output of the key classes and value classes must specify.

No comments:

Post a Comment