org.apache.mahout.utils.nlp.collocations.llr
Class CollocMapper

java.lang.Object
  extended by org.apache.hadoop.mapred.MapReduceBase
      extended by org.apache.mahout.utils.nlp.collocations.llr.CollocMapper
All Implemented Interfaces:
java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>

public class CollocMapper
extends org.apache.hadoop.mapred.MapReduceBase
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>

Pass 1 of the Collocation discovery job which generated ngrams and emits ngrams an their component n-1grams. Input is a SequeceFile, where the key is a document id and the value is the tokenized documents.


Nested Class Summary
static class CollocMapper.Count
           
static class CollocMapper.IteratorTokenStream
          Used to emit tokens from an input string array in the style of TokenStream
 
Field Summary
static int DEFAULT_MAX_SHINGLE_SIZE
           
static java.lang.String MAX_SHINGLE_SIZE
           
 
Constructor Summary
CollocMapper()
           
 
Method Summary
 void configure(org.apache.hadoop.mapred.JobConf job)
           
 void map(org.apache.hadoop.io.Text key, StringTuple value, org.apache.hadoop.mapred.OutputCollector<GramKey,Gram> collector, org.apache.hadoop.mapred.Reporter reporter)
          Collocation finder: pass 1 map phase.
 
Methods inherited from class org.apache.hadoop.mapred.MapReduceBase
close
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface java.io.Closeable
close
 

Field Detail

MAX_SHINGLE_SIZE

public static final java.lang.String MAX_SHINGLE_SIZE
See Also:
Constant Field Values

DEFAULT_MAX_SHINGLE_SIZE

public static final int DEFAULT_MAX_SHINGLE_SIZE
See Also:
Constant Field Values
Constructor Detail

CollocMapper

public CollocMapper()
Method Detail

configure

public void configure(org.apache.hadoop.mapred.JobConf job)
Specified by:
configure in interface org.apache.hadoop.mapred.JobConfigurable
Overrides:
configure in class org.apache.hadoop.mapred.MapReduceBase

map

public void map(org.apache.hadoop.io.Text key,
                StringTuple value,
                org.apache.hadoop.mapred.OutputCollector<GramKey,Gram> collector,
                org.apache.hadoop.mapred.Reporter reporter)
         throws java.io.IOException
Collocation finder: pass 1 map phase.

Receives a token stream which gets passed through a Lucene ShingleFilter. The ShingleFilter delivers ngrams of the appropriate size which are then decomposed into head and tail subgrams which are collected in the following manner

 k:head_key,           v:head_subgram
 k:head_key,ngram_key, v:ngram
 k:tail_key,           v:tail_subgram
 k:tail_key,ngram_key, v:ngram
 
The 'head' or 'tail' prefix is used to specify whether the subgram in question is the head or tail of the ngram. In this implementation the head of the ngram is a (n-1)gram, and the tail is a (1)gram.

For example, given 'click and clack' and an ngram length of 3:

 k: head_'click and'                         v:head_'click and'
 k: head_'click and',ngram_'click and clack' v:ngram_'click and clack'
 k: tail_'clack',                            v:tail_'clack'
 k: tail_'clack',ngram_'click and clack'     v:ngram_'click and clack'
 
Also counts the total number of ngrams encountered and adds it to the counter CollocDriver.Count.NGRAM_TOTAL

Specified by:
map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
Parameters:
collector - The collector to send output to
reporter - Used to deliver the final ngram-count.
Throws:
java.io.IOException - if there's a problem with the ShingleFilter reading data or the collector collecting output.


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.