org.apache.mahout.classifier.bayes.mapreduce.common
Class BayesFeatureMapper

java.lang.Object
  extended by org.apache.hadoop.mapred.MapReduceBase
      extended by org.apache.mahout.classifier.bayes.mapreduce.common.BayesFeatureMapper
All Implemented Interfaces:
java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,StringTuple,org.apache.hadoop.io.DoubleWritable>

public class BayesFeatureMapper
extends org.apache.hadoop.mapred.MapReduceBase
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,StringTuple,org.apache.hadoop.io.DoubleWritable>

Reads the input train set(preprocessed using the BayesFileFormatter).


Nested Class Summary
static class BayesFeatureMapper.IteratorTokenStream
          Used to emit tokens from an input string array in the style of TokenStream
 
Constructor Summary
BayesFeatureMapper()
           
 
Method Summary
 void configure(org.apache.hadoop.mapred.JobConf job)
           
 void map(org.apache.hadoop.io.Text key, org.apache.hadoop.io.Text value, org.apache.hadoop.mapred.OutputCollector<StringTuple,org.apache.hadoop.io.DoubleWritable> output, org.apache.hadoop.mapred.Reporter reporter)
          We need to count the number of times we've seen a term with a given label and we need to output that.
 
Methods inherited from class org.apache.hadoop.mapred.MapReduceBase
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface java.io.Closeable
close
 

Constructor Detail

BayesFeatureMapper

public BayesFeatureMapper()
Method Detail

map

public void map(org.apache.hadoop.io.Text key,
                org.apache.hadoop.io.Text value,
                org.apache.hadoop.mapred.OutputCollector<StringTuple,org.apache.hadoop.io.DoubleWritable> output,
                org.apache.hadoop.mapred.Reporter reporter)
         throws java.io.IOException
We need to count the number of times we've seen a term with a given label and we need to output that. But this Mapper does more than just outputing the count. It first does weight normalisation. Secondly, it outputs for each unique word in a document value 1 for summing up as the Term Document Frequency. Which later is used to calculate the Idf Thirdly, it outputs for each label the number of times a document was seen(Also used in Idf Calculation)

Specified by:
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,StringTuple,org.apache.hadoop.io.DoubleWritable>
Parameters:
key - The label
value - the features (all unique) associated w/ this label in stringtuple format
output - The OutputCollector to write the results to
reporter - Not used
Throws:
java.io.IOException

configure

public void configure(org.apache.hadoop.mapred.JobConf job)
Specified by:
configure in interface org.apache.hadoop.mapred.JobConfigurable
Overrides:
configure in class org.apache.hadoop.mapred.MapReduceBase


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.