Package org.apache.hadoop.mapred

A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner.

See:
          Description

Interface Summary
InputFormat<K,V> Deprecated. Use InputFormat instead.
InputSplit Deprecated. Use InputSplit instead.
JobConfigurable Deprecated.
JobContext Deprecated. Use JobContext instead.
Mapper<K1,V1,K2,V2> Deprecated. Use Mapper instead.
MapRunnable<K1,V1,K2,V2> Deprecated. Use Mapper instead.
OutputCollector<K,V> Collects the <key, value> pairs output by Mappers and Reducers.
OutputFormat<K,V> Deprecated. Use OutputFormat instead.
Partitioner<K2,V2> Deprecated. Use Partitioner instead.
RecordReader<K,V> RecordReader reads <key, value> pairs from an InputSplit.
RecordWriter<K,V> RecordWriter writes the output <key, value> pairs to an output file.
Reducer<K2,V2,K3,V3> Deprecated. Use Reducer instead.
Reporter A facility for Map-Reduce applications to report progress and update counters, status information etc.
RunningJob Deprecated. Use Job instead
SequenceFileInputFilter.Filter filter interface
TaskAttemptContext Deprecated. Use TaskAttemptContext instead.
 

Class Summary
ClusterStatus Deprecated. Use ClusterMetrics or TaskTrackerInfo instead
ClusterStatus.BlackListInfo Class which encapsulates information about a blacklisted tasktracker.
Counters Deprecated. Use Counters instead.
Counters.Counter A counter record, comprising its name and value.
Counters.Group Group of counters, comprising of counters from a particular counter Enum class.
FileInputFormat<K,V> Deprecated. Use FileInputFormat instead.
FileOutputCommitter An OutputCommitter that commits files specified in job output directory i.e.
FileOutputFormat<K,V> A base class for OutputFormat.
FileSplit Deprecated. Use FileSplit instead.
ID Deprecated.
IsolationRunner IsolationRunner is intended to facilitate debugging by re-running a specific task, given left-over task files for a (typically failed) past job.
JobClient Deprecated. Use Job and Cluster instead
JobConf Deprecated. Use Configuration instead
JobID Deprecated.
JobQueueInfo Deprecated. Use QueueInfo instead
JobStatus Deprecated. Use JobStatus instead
KeyValueLineRecordReader Deprecated. Use KeyValueLineRecordReader instead
KeyValueTextInputFormat Deprecated. Use KeyValueTextInputFormat instead
LineRecordReader.LineReader Deprecated. Use LineReader instead.
MapFileOutputFormat Deprecated. Use MapFileOutputFormat instead
MapReduceBase Deprecated.
MapRunner<K1,V1,K2,V2> Default MapRunnable implementation.
MultiFileInputFormat<K,V> Deprecated. Use CombineFileInputFormat instead
MultiFileSplit Deprecated. Use CombineFileSplit instead
OutputCommitter Deprecated. Use OutputCommitter instead.
OutputLogFilter Deprecated. Use Utils.OutputFileUtils.OutputLogFilter instead.
SequenceFileAsBinaryInputFormat Deprecated. Use SequenceFileAsBinaryInputFormat instead
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader Read records from a SequenceFile as binary (raw) bytes.
SequenceFileAsBinaryOutputFormat Deprecated. Use SequenceFileAsBinaryOutputFormat instead
SequenceFileAsBinaryOutputFormat.WritableValueBytes Inner class used for appendRaw
SequenceFileAsTextInputFormat Deprecated. Use SequenceFileAsTextInputFormat instead
SequenceFileAsTextRecordReader Deprecated. Use SequenceFileAsTextRecordReader instead
SequenceFileInputFilter<K,V> Deprecated. Use SequenceFileInputFilter instead
SequenceFileInputFilter.FilterBase base class for Filters
SequenceFileInputFilter.MD5Filter This class returns a set of records by examing the MD5 digest of its key against a filtering frequency f.
SequenceFileInputFilter.PercentFilter This class returns a percentage of records The percentage is determined by a filtering frequency f using the criteria record# % f == 0.
SequenceFileInputFilter.RegexFilter Records filter by matching key to regex
SequenceFileInputFormat<K,V> Deprecated. Use SequenceFileInputFormat instead.
SequenceFileOutputFormat<K,V> Deprecated. Use SequenceFileOutputFormat instead.
SequenceFileRecordReader<K,V> An RecordReader for SequenceFiles.
SkipBadRecords Utility class for skip bad records functionality.
TaskAttemptID Deprecated.
TaskCompletionEvent Deprecated. Use TaskCompletionEvent instead
TaskID Deprecated.
TaskLog.Reader  
TaskLogAppender A simple log4j-appender for the task child's map-reduce system logs.
TaskReport Deprecated. Use TaskReport instead
TextInputFormat Deprecated. Use TextInputFormat instead.
TextOutputFormat<K,V> Deprecated. Use TextOutputFormat instead.
TextOutputFormat.LineRecordWriter<K,V>  
Utils A utility class.
Utils.OutputFileUtils  
Utils.OutputFileUtils.OutputFilesFilter This class filters output(part) files from the given directory It does not accept files with filenames _logs and _SUCCESS.
Utils.OutputFileUtils.OutputLogFilter This class filters log files from directory given It doesnt accept paths having _logs.
 

Enum Summary
JobClient.TaskStatusFilter  
JobPriority Deprecated. Use JobPriority instead
TaskCompletionEvent.Status  
 

Exception Summary
FileAlreadyExistsException Deprecated.
InvalidFileTypeException Used when file type differs from the desired file type.
InvalidInputException This class wraps a list of problems with the input, so that the user can get a list of problems together instead of finding and fixing them one by one.
InvalidJobConfException This exception is thrown when jobconf misses some mendatory attributes or value of some attributes is invalid.
 

Package org.apache.hadoop.mapred Description

A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner.

A Map-Reduce job usually splits the input data-set into independent chunks which processed by map tasks in completely parallel manner, followed by reduce tasks which aggregating their output. Typically both the input and the output of the job are stored in a FileSystem. The framework takes care of monitoring tasks and re-executing failed ones. Since, usually, the compute nodes and the storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed FileSystem are running on the same set of nodes, tasks are effectively scheduled on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The Map-Reduce framework operates exclusively on <key, value> pairs i.e. the input to the job is viewed as a set of <key, value> pairs and the output as another, possibly different, set of <key, value> pairs. The keys and values have to be serializable as Writables and additionally the keys have to be WritableComparables in order to facilitate grouping by the framework.

Data flow:

                                (input)
                                <k1, v1>
       
                                   |
                                   V
       
                                  map
       
                                   |
                                   V

                                <k2, v2>
       
                                   |
                                   V
       
                                combine
       
                                   |
                                   V
       
                                <k2, v2>
       
                                   |
                                   V
       
                                 reduce
       
                                   |
                                   V
       
                                <k3, v3>
                                (output)

Applications typically implement Mapper.map(Object, Object, OutputCollector, Reporter) and Reducer.reduce(Object, Iterator, OutputCollector, Reporter) methods. The application-writer also specifies various facets of the job such as input and output locations, the Partitioner, InputFormat & OutputFormat implementations to be used etc. as a JobConf. The client program, JobClient, then submits the job to the framework and optionally monitors it.

The framework spawns one map task per InputSplit generated by the InputFormat of the job and calls Mapper.map(Object, Object, OutputCollector, Reporter) with each <key, value> pair read by the RecordReader from the InputSplit for the task. The intermediate outputs of the maps are then grouped by keys and optionally aggregated by combiner. The key space of intermediate outputs are paritioned by the Partitioner, where the number of partitions is exactly the number of reduce tasks for the job.

The reduce tasks fetch the sorted intermediate outputs of the maps, via http, merge the <key, value> pairs and call Reducer.reduce(Object, Iterator, OutputCollector, Reporter) for each <key, list of values> pair. The output of the reduce tasks' is stored on the FileSystem by the RecordWriter provided by the OutputFormat of the job.

Map-Reduce application to perform a distributed grep:


public class Grep extends Configured implements Tool {

  // map: Search for the pattern specified by 'grep.mapper.regex' &
  //      'grep.mapper.regex.group'

  class GrepMapper<K, Text> 
  extends MapReduceBase  implements Mapper<K, Text, Text, LongWritable> {

    private Pattern pattern;
    private int group;

    public void configure(JobConf job) {
      pattern = Pattern.compile(job.get("grep.mapper.regex"));
      group = job.getInt("grep.mapper.regex.group", 0);
    }

    public void map(K key, Text value,
                    OutputCollector<Text, LongWritable> output,
                    Reporter reporter)
    throws IOException {
      String text = value.toString();
      Matcher matcher = pattern.matcher(text);
      while (matcher.find()) {
        output.collect(new Text(matcher.group(group)), new LongWritable(1));
      }
    }
  }

  // reduce: Count the number of occurrences of the pattern

  class GrepReducer<K> extends MapReduceBase
  implements Reducer<K, LongWritable, K, LongWritable> {

    public void reduce(K key, Iterator<LongWritable> values,
                       OutputCollector<K, LongWritable> output,
                       Reporter reporter)
    throws IOException {

      // sum all values for this key
      long sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }

      // output sum
      output.collect(key, new LongWritable(sum));
    }
  }
  
  public int run(String[] args) throws Exception {
    if (args.length < 3) {
      System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
      ToolRunner.printGenericCommandUsage(System.out);
      return -1;
    }

    JobConf grepJob = new JobConf(getConf(), Grep.class);
    
    grepJob.setJobName("grep");

    FileInputFormat.setInputPaths(grepJob, new Path(args[0]));
    FileOutputFormat.setOutputPath(grepJob, args[1]);

    grepJob.setMapperClass(GrepMapper.class);
    grepJob.setCombinerClass(GrepReducer.class);
    grepJob.setReducerClass(GrepReducer.class);

    grepJob.set("mapreduce.mapper.regex", args[2]);
    if (args.length == 4)
      grepJob.set("mapreduce.mapper.regexmapper..group", args[3]);

    grepJob.setOutputFormat(SequenceFileOutputFormat.class);
    grepJob.setOutputKeyClass(Text.class);
    grepJob.setOutputValueClass(LongWritable.class);

    JobClient.runJob(grepJob);

    return 0;
  }

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Grep(), args);
    System.exit(res);
  }

}

Notice how the data-flow of the above grep job is very similar to doing the same via the unix pipeline:

cat input/*   |   grep   |   sort    |   uniq -c   >   out
      input   |    map   |  shuffle  |   reduce    >   out

Hadoop Map-Reduce applications need not be written in JavaTM only. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement Map-Reduce applications (non JNITM based).

See Google's original Map/Reduce paper for background information.

Java and JNI are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.



Copyright © 2009 The Apache Software Foundation