org.apache.mahout.vectorizer.tfidf
Class TFIDFConverter
java.lang.Object
org.apache.mahout.vectorizer.tfidf.TFIDFConverter
public final class TFIDFConverter
- extends java.lang.Object
This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input
should have a WritableComparable
key containing and a
VectorWritable
value containing the
term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf
format
Method Summary |
static void |
processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
int chunkSizeInMegabytes,
int minDf,
int maxDFPercent,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile format. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
VECTOR_COUNT
public static final java.lang.String VECTOR_COUNT
- See Also:
- Constant Field Values
FEATURE_COUNT
public static final java.lang.String FEATURE_COUNT
- See Also:
- Constant Field Values
MIN_DF
public static final java.lang.String MIN_DF
- See Also:
- Constant Field Values
MAX_DF_PERCENTAGE
public static final java.lang.String MAX_DF_PERCENTAGE
- See Also:
- Constant Field Values
processTfIdf
public static void processTfIdf(org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output,
int chunkSizeInMegabytes,
int minDf,
int maxDFPercent,
float normPower,
boolean logNormalize,
boolean sequentialAccessOutput,
boolean namedVector,
int numReducers)
throws java.io.IOException,
java.lang.InterruptedException,
java.lang.ClassNotFoundException
- Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile
format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.
- Parameters:
input
- input directory of the vectors in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generatedchunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swappingminDf
- The minimum document frequency. Default 1maxDFPercent
- The max percentage of vectors for the DF. Can be used to remove really high frequency features.
Expressed as an integer between 0 and 100. Default 99numReducers
- The number of reducers to spawn. This also affects the possible parallelism since each reducer
will typically produce a single output file containing tf-idf vectors for a subset of the
documents in the corpus.
- Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.