org.apache.mahout.utils.vectors.tfidf
Class TFIDFConverter
java.lang.Object
org.apache.mahout.utils.vectors.tfidf.TFIDFConverter
public final class TFIDFConverter
- extends java.lang.Object
This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input
should have a WritableComparable
key containing and a VectorWritable
value containing the
term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf
format
Method Summary |
static org.apache.hadoop.fs.Path |
getPath(java.lang.String basePath,
int index)
|
static void |
processTfIdf(java.lang.String input,
java.lang.String output,
int chunkSizeInMegabytes,
int minDf,
int maxDFPercent,
float normPower,
boolean sequentialAccessOutput)
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile format. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
VECTOR_COUNT
public static final java.lang.String VECTOR_COUNT
- See Also:
- Constant Field Values
FEATURE_COUNT
public static final java.lang.String FEATURE_COUNT
- See Also:
- Constant Field Values
MIN_DF
public static final java.lang.String MIN_DF
- See Also:
- Constant Field Values
MAX_DF_PERCENTAGE
public static final java.lang.String MAX_DF_PERCENTAGE
- See Also:
- Constant Field Values
TFIDF_OUTPUT_FOLDER
public static final java.lang.String TFIDF_OUTPUT_FOLDER
- See Also:
- Constant Field Values
processTfIdf
public static void processTfIdf(java.lang.String input,
java.lang.String output,
int chunkSizeInMegabytes,
int minDf,
int maxDFPercent,
float normPower,
boolean sequentialAccessOutput)
throws java.io.IOException
- Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in
SequenceFile
format. This job uses a fixed limit on the maximum memory used by the feature chunk
per node thereby splitting the process across multiple map/reduces.
- Parameters:
input
- input directory of the vectors in SequenceFile
formatoutput
- output directory where RandomAccessSparseVector
's of the document
are generatedminDf
- The minimum document frequency. Default 1maxDFPercent
- The max percentage of vectors for the DF. Can be used to remove really high frequency features.
Expressed as an integer between 0 and 100. Default 99chunkSizeInMegabytes
- the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce
stage. Its recommended you calculated this based on the number of cores and the free memory
available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we
recommend you use a split size of around 400-500MB so that two simultaneous reducers can create
partial vectors without thrashing the system due to increased swapping
- Throws:
java.io.IOException
getPath
public static org.apache.hadoop.fs.Path getPath(java.lang.String basePath,
int index)
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.