org.apache.mahout.vectorizer
Class DictionaryVectorizer
java.lang.Object
org.apache.mahout.vectorizer.DictionaryVectorizer
public final class DictionaryVectorizer
- extends Object
This class converts a set of input documents in the sequence file format to vectors. The Sequence file
input should have a Text
key containing the unique document identifier and a StringTuple
value containing the tokenized document. You may use DocumentProcessor
to tokenize the document.
This is a dictionary based Vectorizer.
Method Summary |
static void |
createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
Create Term Frequency (Tf) Vectors from the input set of documents in SequenceFile format. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DOCUMENT_VECTOR_OUTPUT_FOLDER
public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER
- See Also:
- Constant Field Values
MIN_SUPPORT
public static final String MIN_SUPPORT
- See Also:
- Constant Field Values
MAX_NGRAMS
public static final String MAX_NGRAMS
- See Also:
- Constant Field Values
DEFAULT_MIN_SUPPORT
public static final int DEFAULT_MIN_SUPPORT
- See Also:
- Constant Field Values
createTermFrequencyVectors
public static void createTermFrequencyVectors(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
org.apache.hadoop.conf.Configuration baseConf,
int minSupport,
int maxNGramSize,
float minLLRValue,
float normPower,
boolean logNormalize,
int numReducers,
int chunkSizeInMegabytes,
boolean sequentialAccess,
boolean namedVectors)
throws IOException,
InterruptedException,
ClassNotFoundException
- Create Term Frequency (Tf) Vectors from the input set of documents in
SequenceFile
format. This
tries to fix the maximum memory used by the feature chunk per node thereby splitting the process across
multiple map/reduces.
- Parameters:
input
- input directory of the documents in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generatedbaseConf
- job configurationnormPower
- L_p norm to be computedlogNormalize
- whether to use log normalizationminSupport
- the minimum frequency of the feature in the entire corpus to be considered for inclusion in the
sparse vectormaxNGramSize
- 1 = unigram, 2 = unigram and bigram, 3 = unigram, bigram and trigramminLLRValue
- minValue of log likelihood ratio to used to prune ngramschunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swapping
- Throws:
IOException
InterruptedException
ClassNotFoundException
Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.