|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
InputFormat<K,V> | Deprecated. Use InputFormat instead. |
InputSplit | Deprecated. Use InputSplit instead. |
JobConfigurable | Deprecated. |
JobContext | Deprecated. Use JobContext instead. |
Mapper<K1,V1,K2,V2> | Deprecated. Use Mapper instead. |
MapRunnable<K1,V1,K2,V2> | Deprecated. Use Mapper instead. |
OutputCollector<K,V> | Collects the <key, value> pairs output by Mapper s
and Reducer s. |
OutputFormat<K,V> | Deprecated. Use OutputFormat instead. |
Partitioner<K2,V2> | Deprecated. Use Partitioner instead. |
RecordReader<K,V> | RecordReader reads <key, value> pairs from an
InputSplit . |
RecordWriter<K,V> | RecordWriter writes the output <key, value> pairs
to an output file. |
Reducer<K2,V2,K3,V3> | Deprecated. Use Reducer instead. |
Reporter | A facility for Map-Reduce applications to report progress and update counters, status information etc. |
RunningJob | Deprecated. Use Job instead |
SequenceFileInputFilter.Filter | filter interface |
TaskAttemptContext | Deprecated. Use TaskAttemptContext
instead. |
Class Summary | |
---|---|
ClusterStatus | Deprecated. Use ClusterMetrics or TaskTrackerInfo instead |
ClusterStatus.BlackListInfo | Class which encapsulates information about a blacklisted tasktracker. |
Counters | Deprecated. Use Counters instead. |
Counters.Counter | A counter record, comprising its name and value. |
Counters.Group | Group of counters, comprising of counters from a particular
counter Enum class. |
FileInputFormat<K,V> | Deprecated. Use FileInputFormat
instead. |
FileOutputCommitter | An OutputCommitter that commits files specified
in job output directory i.e. |
FileOutputFormat<K,V> | A base class for OutputFormat . |
FileSplit | Deprecated. Use FileSplit
instead. |
ID | Deprecated. |
IsolationRunner | IsolationRunner is intended to facilitate debugging by re-running a specific task, given left-over task files for a (typically failed) past job. |
JobClient | Deprecated. Use Job and Cluster instead |
JobConf | Deprecated. Use Configuration instead |
JobID | Deprecated. |
JobQueueInfo | Deprecated. Use QueueInfo instead |
JobStatus | Deprecated. Use JobStatus instead |
KeyValueLineRecordReader | Deprecated. Use
KeyValueLineRecordReader
instead |
KeyValueTextInputFormat | Deprecated. Use
KeyValueTextInputFormat
instead |
LineRecordReader.LineReader | Deprecated. Use LineReader instead. |
MapFileOutputFormat | Deprecated. Use
MapFileOutputFormat instead |
MapReduceBase | Deprecated. |
MapRunner<K1,V1,K2,V2> | Default MapRunnable implementation. |
MultiFileInputFormat<K,V> | Deprecated. Use CombineFileInputFormat instead |
MultiFileSplit | Deprecated. Use CombineFileSplit instead |
OutputCommitter | Deprecated. Use OutputCommitter instead. |
OutputLogFilter | Deprecated. Use
Utils.OutputFileUtils.OutputLogFilter
instead. |
SequenceFileAsBinaryInputFormat | Deprecated. Use
SequenceFileAsBinaryInputFormat
instead |
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader | Read records from a SequenceFile as binary (raw) bytes. |
SequenceFileAsBinaryOutputFormat | Deprecated. Use
SequenceFileAsBinaryOutputFormat
instead |
SequenceFileAsBinaryOutputFormat.WritableValueBytes | Inner class used for appendRaw |
SequenceFileAsTextInputFormat | Deprecated. Use
SequenceFileAsTextInputFormat
instead |
SequenceFileAsTextRecordReader | Deprecated. Use
SequenceFileAsTextRecordReader
instead |
SequenceFileInputFilter<K,V> | Deprecated. Use
SequenceFileInputFilter
instead |
SequenceFileInputFilter.FilterBase | base class for Filters |
SequenceFileInputFilter.MD5Filter | This class returns a set of records by examing the MD5 digest of its key against a filtering frequency f. |
SequenceFileInputFilter.PercentFilter | This class returns a percentage of records The percentage is determined by a filtering frequency f using the criteria record# % f == 0. |
SequenceFileInputFilter.RegexFilter | Records filter by matching key to regex |
SequenceFileInputFormat<K,V> | Deprecated. Use
SequenceFileInputFormat
instead. |
SequenceFileOutputFormat<K,V> | Deprecated. Use
SequenceFileOutputFormat
instead. |
SequenceFileRecordReader<K,V> | An RecordReader for SequenceFile s. |
SkipBadRecords | Utility class for skip bad records functionality. |
TaskAttemptID | Deprecated. |
TaskCompletionEvent | Deprecated. Use
TaskCompletionEvent instead |
TaskID | Deprecated. |
TaskLog.Reader | |
TaskLogAppender | A simple log4j-appender for the task child's map-reduce system logs. |
TaskReport | Deprecated. Use TaskReport instead |
TextInputFormat | Deprecated. Use TextInputFormat
instead. |
TextOutputFormat<K,V> | Deprecated. Use
TextOutputFormat instead. |
TextOutputFormat.LineRecordWriter<K,V> | |
Utils | A utility class. |
Utils.OutputFileUtils | |
Utils.OutputFileUtils.OutputFilesFilter | This class filters output(part) files from the given directory It does not accept files with filenames _logs and _SUCCESS. |
Utils.OutputFileUtils.OutputLogFilter | This class filters log files from directory given It doesnt accept paths having _logs. |
Enum Summary | |
---|---|
JobClient.TaskStatusFilter | |
JobPriority | Deprecated. Use JobPriority instead |
TaskCompletionEvent.Status |
Exception Summary | |
---|---|
FileAlreadyExistsException | Deprecated. |
InvalidFileTypeException | Used when file type differs from the desired file type. |
InvalidInputException | This class wraps a list of problems with the input, so that the user can get a list of problems together instead of finding and fixing them one by one. |
InvalidJobConfException | This exception is thrown when jobconf misses some mendatory attributes or value of some attributes is invalid. |
A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner.
A Map-Reduce job usually splits the input data-set into independent
chunks which processed by map tasks in completely parallel manner,
followed by reduce tasks which aggregating their output. Typically both
the input and the output of the job are stored in a
FileSystem
. The framework takes care of monitoring
tasks and re-executing failed ones. Since, usually, the compute nodes and the
storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed
FileSystem are running on the same set of nodes, tasks are effectively scheduled
on the nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.
The Map-Reduce framework operates exclusively on <key, value>
pairs i.e. the input to the job is viewed as a set of <key, value>
pairs and the output as another, possibly different, set of
<key, value> pairs. The keys and values have to
be serializable as Writable
s and additionally the
keys have to be WritableComparable
s in
order to facilitate grouping by the framework.
Data flow:
(input) <k1, v1> | V map | V <k2, v2> | V combine | V <k2, v2> | V reduce | V <k3, v3> (output)
Applications typically implement
Mapper.map(Object, Object, OutputCollector, Reporter)
and
Reducer.reduce(Object, Iterator, OutputCollector, Reporter)
methods. The application-writer also specifies various facets of the job such
as input and output locations, the Partitioner, InputFormat
& OutputFormat implementations to be used etc. as
a JobConf
. The client program,
JobClient
, then submits the job to the framework
and optionally monitors it.
The framework spawns one map task per
InputSplit
generated by the
InputFormat
of the job and calls
Mapper.map(Object, Object, OutputCollector, Reporter)
with each <key, value> pair read by the
RecordReader
from the InputSplit for
the task. The intermediate outputs of the maps are then grouped by keys
and optionally aggregated by combiner. The key space of intermediate
outputs are paritioned by the Partitioner
, where
the number of partitions is exactly the number of reduce tasks for the job.
The reduce tasks fetch the sorted intermediate outputs of the maps, via http,
merge the <key, value> pairs and call
Reducer.reduce(Object, Iterator, OutputCollector, Reporter)
for each <key, list of values> pair. The output of the reduce tasks' is
stored on the FileSystem by the
RecordWriter
provided by the
OutputFormat
of the job.
Map-Reduce application to perform a distributed grep:
public class Grep extends Configured implements Tool { // map: Search for the pattern specified by 'grep.mapper.regex' & // 'grep.mapper.regex.group' class GrepMapper<K, Text> extends MapReduceBase implements Mapper<K, Text, Text, LongWritable> { private Pattern pattern; private int group; public void configure(JobConf job) { pattern = Pattern.compile(job.get("grep.mapper.regex")); group = job.getInt("grep.mapper.regex.group", 0); } public void map(K key, Text value, OutputCollector<Text, LongWritable> output, Reporter reporter) throws IOException { String text = value.toString(); Matcher matcher = pattern.matcher(text); while (matcher.find()) { output.collect(new Text(matcher.group(group)), new LongWritable(1)); } } } // reduce: Count the number of occurrences of the pattern class GrepReducer<K> extends MapReduceBase implements Reducer<K, LongWritable, K, LongWritable> { public void reduce(K key, Iterator<LongWritable> values, OutputCollector<K, LongWritable> output, Reporter reporter) throws IOException { // sum all values for this key long sum = 0; while (values.hasNext()) { sum += values.next().get(); } // output sum output.collect(key, new LongWritable(sum)); } } public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println("Grep <inDir> <outDir> <regex> [<group>]"); ToolRunner.printGenericCommandUsage(System.out); return -1; } JobConf grepJob = new JobConf(getConf(), Grep.class); grepJob.setJobName("grep"); FileInputFormat.setInputPaths(grepJob, new Path(args[0])); FileOutputFormat.setOutputPath(grepJob, args[1]); grepJob.setMapperClass(GrepMapper.class); grepJob.setCombinerClass(GrepReducer.class); grepJob.setReducerClass(GrepReducer.class); grepJob.set("mapreduce.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapreduce.mapper.regexmapper..group", args[3]); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); return 0; } public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Grep(), args); System.exit(res); } }
Notice how the data-flow of the above grep job is very similar to doing the same via the unix pipeline:
cat input/* | grep | sort | uniq -c > out
input | map | shuffle | reduce > out
Hadoop Map-Reduce applications need not be written in JavaTM only. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement Map-Reduce applications (non JNITM based).
See Google's original Map/Reduce paper for background information.
Java and JNI are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |