Tools to permit using Avro data
with Hadoop MapReduce jobs.
Avro data files do not contain key/value pairs as expected by
Hadoop's MapReduce API, but rather just a sequence of values. Thus
we provide here a layer on top of Hadoop's MapReduce API which
eliminates the key/value distinction.
To use this for jobs whose input and output are Avro data files:
- Subclass {@link org.apache.avro.mapred.AvroMapper} and specify
this as your job's mapper.
- Subclass {@link org.apache.avro.mapred.AvroReducer} and specify
this as your job's reducer and perhaps combiner.
- Depending on whether your mapper uses Avro's specific or
generic API for inputs, call one of {@link
org.apache.avro.mapred.AvroJob#setInputSpecific} or {@link
org.apache.avro.mapred.AvroJob#setInputGeneric} with your input schema.
- Depending on whether your job uses Avro's specific or generic
API for outputs, call one of {@link
org.apache.avro.mapred.AvroJob#setOutputSpecific} or {@link
org.apache.avro.mapred.AvroJob#setOutputGeneric} with your output
schema.
- Specify input files with {@link org.apache.hadoop.mapred.FileInputFormat#setInputPaths}
- Specify an output directory with {@link
org.apache.hadoop.mapred.FileOutputFormat#setOutputPath}
- Run your job with {@link org.apache.hadoop.mapred.JobClient#runJob}