org.apache.mahout.utils.nlp.collocations.llr
Class CollocMapper
java.lang.Object
org.apache.hadoop.mapred.MapReduceBase
org.apache.mahout.utils.nlp.collocations.llr.CollocMapper
- All Implemented Interfaces:
- java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
public class CollocMapper
- extends org.apache.hadoop.mapred.MapReduceBase
- implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
Pass 1 of the Collocation discovery job which generated ngrams and emits ngrams an their component n-1grams.
Input is a SequeceFile, where the key is a document id and the value is the tokenized documents.
Method Summary |
void |
configure(org.apache.hadoop.mapred.JobConf job)
|
void |
map(org.apache.hadoop.io.Text key,
StringTuple value,
org.apache.hadoop.mapred.OutputCollector<GramKey,Gram> collector,
org.apache.hadoop.mapred.Reporter reporter)
Collocation finder: pass 1 map phase. |
Methods inherited from class org.apache.hadoop.mapred.MapReduceBase |
close |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface java.io.Closeable |
close |
MAX_SHINGLE_SIZE
public static final java.lang.String MAX_SHINGLE_SIZE
- See Also:
- Constant Field Values
DEFAULT_MAX_SHINGLE_SIZE
public static final int DEFAULT_MAX_SHINGLE_SIZE
- See Also:
- Constant Field Values
CollocMapper
public CollocMapper()
configure
public void configure(org.apache.hadoop.mapred.JobConf job)
- Specified by:
configure
in interface org.apache.hadoop.mapred.JobConfigurable
- Overrides:
configure
in class org.apache.hadoop.mapred.MapReduceBase
map
public void map(org.apache.hadoop.io.Text key,
StringTuple value,
org.apache.hadoop.mapred.OutputCollector<GramKey,Gram> collector,
org.apache.hadoop.mapred.Reporter reporter)
throws java.io.IOException
- Collocation finder: pass 1 map phase.
Receives a token stream which gets passed through a Lucene ShingleFilter. The ShingleFilter delivers ngrams of
the appropriate size which are then decomposed into head and tail subgrams which are collected in the
following manner
k:head_key, v:head_subgram
k:head_key,ngram_key, v:ngram
k:tail_key, v:tail_subgram
k:tail_key,ngram_key, v:ngram
The 'head' or 'tail' prefix is used to specify whether the subgram in question is the head or tail of the
ngram. In this implementation the head of the ngram is a (n-1)gram, and the tail is a (1)gram.
For example, given 'click and clack' and an ngram length of 3:
k: head_'click and' v:head_'click and'
k: head_'click and',ngram_'click and clack' v:ngram_'click and clack'
k: tail_'clack', v:tail_'clack'
k: tail_'clack',ngram_'click and clack' v:ngram_'click and clack'
Also counts the total number of ngrams encountered and adds it to the counter
CollocDriver.Count.NGRAM_TOTAL
- Specified by:
map
in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
- Parameters:
collector
- The collector to send output toreporter
- Used to deliver the final ngram-count.
- Throws:
java.io.IOException
- if there's a problem with the ShingleFilter reading data or the collector collecting output.
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.