org.apache.mahout.utils.nlp.collocations.llr
Class CollocReducer

java.lang.Object
  extended by org.apache.hadoop.mapred.MapReduceBase
      extended by org.apache.mahout.utils.nlp.collocations.llr.CollocReducer
All Implemented Interfaces:
java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Reducer<GramKey,Gram,Gram,Gram>

public class CollocReducer
extends org.apache.hadoop.mapred.MapReduceBase
implements org.apache.hadoop.mapred.Reducer<GramKey,Gram,Gram,Gram>

Reducer for Pass 1 of the collocation identification job. Generates counts for ngrams and subgrams.


Nested Class Summary
static class CollocReducer.Skipped
           
 
Field Summary
static int DEFAULT_MIN_SUPPORT
           
static java.lang.String MIN_SUPPORT
           
 
Constructor Summary
CollocReducer()
           
 
Method Summary
 void configure(org.apache.hadoop.mapred.JobConf job)
           
protected  void processSubgram(GramKey key, java.util.Iterator<Gram> values, org.apache.hadoop.mapred.OutputCollector<Gram,Gram> output, org.apache.hadoop.mapred.Reporter reporter)
          Sum frequencies for subgram, ngrams and deliver ngram, subgram pairs to the collector.
protected  void processUnigram(GramKey key, java.util.Iterator<Gram> values, org.apache.hadoop.mapred.OutputCollector<Gram,Gram> output, org.apache.hadoop.mapred.Reporter reporter)
          Sum frequencies for unigrams and deliver to the collector
 void reduce(GramKey key, java.util.Iterator<Gram> values, org.apache.hadoop.mapred.OutputCollector<Gram,Gram> output, org.apache.hadoop.mapred.Reporter reporter)
          collocation finder: pass 1 reduce phase:

given input from the mapper,

 
Methods inherited from class org.apache.hadoop.mapred.MapReduceBase
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface java.io.Closeable
close
 

Field Detail

MIN_SUPPORT

public static final java.lang.String MIN_SUPPORT
See Also:
Constant Field Values

DEFAULT_MIN_SUPPORT

public static final int DEFAULT_MIN_SUPPORT
See Also:
Constant Field Values
Constructor Detail

CollocReducer

public CollocReducer()
Method Detail

configure

public void configure(org.apache.hadoop.mapred.JobConf job)
Specified by:
configure in interface org.apache.hadoop.mapred.JobConfigurable
Overrides:
configure in class org.apache.hadoop.mapred.MapReduceBase

reduce

public void reduce(GramKey key,
                   java.util.Iterator<Gram> values,
                   org.apache.hadoop.mapred.OutputCollector<Gram,Gram> output,
                   org.apache.hadoop.mapred.Reporter reporter)
            throws java.io.IOException
collocation finder: pass 1 reduce phase:

given input from the mapper,

 k:head_subgram,ngram,  v:ngram:partial freq
 k:head_subgram         v:head_subgram:partial freq
 k:tail_subgram,ngram,  v:ngram:partial freq
 k:tail_subgram         v:tail_subgram:partial freq
 k:unigram              v:unigram:partial freq
 
sum gram frequencies and output for llr calculation

output is:

 k:ngram:ngramfreq      v:head_subgram:head_subgramfreq
 k:ngram:ngramfreq      v:tail_subgram:tail_subgramfreq
 k:unigram:unigramfreq  v:unigram:unigramfreq
 
Each ngram's frequency is essentially counted twice, once for head, once for tail. frequency should be the same for the head and tail. Fix this to count only for the head and move the count into the value?

Specified by:
reduce in interface org.apache.hadoop.mapred.Reducer<GramKey,Gram,Gram,Gram>
Throws:
java.io.IOException

processUnigram

protected void processUnigram(GramKey key,
                              java.util.Iterator<Gram> values,
                              org.apache.hadoop.mapred.OutputCollector<Gram,Gram> output,
                              org.apache.hadoop.mapred.Reporter reporter)
                       throws java.io.IOException
Sum frequencies for unigrams and deliver to the collector

Throws:
java.io.IOException

processSubgram

protected void processSubgram(GramKey key,
                              java.util.Iterator<Gram> values,
                              org.apache.hadoop.mapred.OutputCollector<Gram,Gram> output,
                              org.apache.hadoop.mapred.Reporter reporter)
                       throws java.io.IOException
Sum frequencies for subgram, ngrams and deliver ngram, subgram pairs to the collector.

Sort order guarantees that the subgram/subgram pairs will be seen first and then subgram/ngram1 pairs, subgram/ngram2 pairs ... subgram/ngramN pairs, so frequencies for ngrams can be calcualted here as well.

We end up calculating frequencies for ngrams for each sugram (head, tail) here, which is some extra work.

Throws:
java.io.IOException


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.