org.apache.mahout.vectorizer
Class DictionaryVectorizer
java.lang.Object
org.apache.mahout.vectorizer.DictionaryVectorizer
public final class DictionaryVectorizer
- extends java.lang.Object
This class converts a set of input documents in the sequence file format to vectors. The Sequence file
input should have a Text
key containing the unique document identifier and a StringTuple
value containing the tokenized document. You may use DocumentProcessor
to tokenize the document.
This is a dictionary based Vectorizer.
Method Summary |
static void |
createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DOCUMENT_VECTOR_OUTPUT_FOLDER
public static final java.lang.String DOCUMENT_VECTOR_OUTPUT_FOLDER
- See Also:
- Constant Field Values
MIN_SUPPORT
public static final java.lang.String MIN_SUPPORT
- See Also:
- Constant Field Values
MAX_NGRAMS
public static final java.lang.String MAX_NGRAMS
- See Also:
- Constant Field Values
DEFAULT_MIN_SUPPORT
public static final int DEFAULT_MIN_SUPPORT
- See Also:
- Constant Field Values
createTermFrequencyVectors
public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
throws java.io.IOException,
java.lang.InterruptedException,
java.lang.ClassNotFoundException
- Create Term Frequency (Tf) Vectors from the input set of documents in
SequenceFile
format. This
tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across
multiple map/reduces.
- Parameters:
input
- input directory of the documents in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generatednormPower
- L_p norm to be computedlogNormalize
- whether to use log normalizationminSupport
- the minimum frequency of the feature in the entire corpus to be considered for inclusion in the
sparse vectormaxNGramSize
- 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigramminLLRValue
- minValue of log likelihood ratio to used to prune ngramschunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swapping
- Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.