Run Hadoop MapReduce jobs over
Avro data, with map and reduce functions written in Java.
Avro data files do not contain key/value pairs as expected by
Hadoop's MapReduce API, but rather just a sequence of values. Thus
we provide here a layer on top of Hadoop's MapReduce API which
eliminates the key/value distinction.
To use this for jobs whose input and output are Avro data files:
- Call {@link org.apache.avro.mapred.AvroJob#setInputSchema} and
{@link org.apache.avro.mapred.AvroJob#setOutputSchema} with your
job's input and output schemas.
- Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
this as your job's mapper with {@link
org.apache.avro.mapred.AvroJob#setMapperClass}
- Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
this as your job's reducer and perhaps combiner, with {@link
org.apache.avro.mapred.AvroJob#setReducerClass} and {@link
org.apache.avro.mapred.AvroJob#setCombinerClass}
- Specify input files with {@link org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
- Specify an output directory with {@link
org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
- Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}