org.apache.mahout.vectorizer
Class DictionaryVectorizer

java.lang.Object
  extended by org.apache.mahout.vectorizer.DictionaryVectorizer

public final class DictionaryVectorizer
extends java.lang.Object

This class converts a set of input documents in the sequence file format to vectors. The Sequence file input should have a Text key containing the unique document identifier and a StringTuple value containing the tokenized document. You may use DocumentProcessor to tokenize the document. This is a dictionary based Vectorizer.


Field Summary
static int DEFAULT_MIN_SUPPORT
           
static java.lang.String DOCUMENT_VECTOR_OUTPUT_FOLDER
           
static java.lang.String MAX_NGRAMS
           
static java.lang.String MIN_SUPPORT
           
 
Method Summary
static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int minSupport, int maxNGramSize, float minLLRValue, float normPower, boolean logNormalize, int numReducers, int chunkSizeInMegabytes, boolean sequentialAccess, boolean namedVectors)
          Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DOCUMENT_VECTOR_OUTPUT_FOLDER

public static final java.lang.String DOCUMENT_VECTOR_OUTPUT_FOLDER
See Also:
Constant Field Values

MIN_SUPPORT

public static final java.lang.String MIN_SUPPORT
See Also:
Constant Field Values

MAX_NGRAMS

public static final java.lang.String MAX_NGRAMS
See Also:
Constant Field Values

DEFAULT_MIN_SUPPORT

public static final int DEFAULT_MIN_SUPPORT
See Also:
Constant Field Values
Method Detail

createTermFrequencyVectors

public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
                                              org.apache.hadoop.fs.Path output,
                                              org.apache.hadoop.conf.Configuration baseConf,
                                              int minSupport,
                                              int maxNGramSize,
                                              float minLLRValue,
                                              float normPower,
                                              boolean logNormalize,
                                              int numReducers,
                                              int chunkSizeInMegabytes,
                                              boolean sequentialAccess,
                                              boolean namedVectors)
                                       throws java.io.IOException,
                                              java.lang.InterruptedException,
                                              java.lang.ClassNotFoundException
Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format. This tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:
input - input directory of the documents in SequenceFile format
output - output directory where RandomAccessSparseVector's of the document are generated
normPower - L_p norm to be computed
logNormalize - whether to use log normalization
minSupport - the minimum frequency of the feature in the entire corpus to be considered for inclusion in the sparse vector
maxNGramSize - 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigram
minLLRValue - minValue of log likelihood ratio to used to prune ngrams
chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.