org.apache.mahout.utils.vectors.tfidf
Class TFIDFConverter

java.lang.Object
  extended by org.apache.mahout.utils.vectors.tfidf.TFIDFConverter

public final class TFIDFConverter
extends java.lang.Object

This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input should have a WritableComparable key containing and a VectorWritable value containing the term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf format


Field Summary
static java.lang.String FEATURE_COUNT
           
static java.lang.String MAX_DF_PERCENTAGE
           
static java.lang.String MIN_DF
           
static java.lang.String TFIDF_OUTPUT_FOLDER
           
static java.lang.String VECTOR_COUNT
           
 
Method Summary
static org.apache.hadoop.fs.Path getPath(java.lang.String basePath, int index)
           
static void processTfIdf(java.lang.String input, java.lang.String output, int chunkSizeInMegabytes, int minDf, int maxDFPercent, float normPower, boolean sequentialAccessOutput)
          Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VECTOR_COUNT

public static final java.lang.String VECTOR_COUNT
See Also:
Constant Field Values

FEATURE_COUNT

public static final java.lang.String FEATURE_COUNT
See Also:
Constant Field Values

MIN_DF

public static final java.lang.String MIN_DF
See Also:
Constant Field Values

MAX_DF_PERCENTAGE

public static final java.lang.String MAX_DF_PERCENTAGE
See Also:
Constant Field Values

TFIDF_OUTPUT_FOLDER

public static final java.lang.String TFIDF_OUTPUT_FOLDER
See Also:
Constant Field Values
Method Detail

processTfIdf

public static void processTfIdf(java.lang.String input,
                                java.lang.String output,
                                int chunkSizeInMegabytes,
                                int minDf,
                                int maxDFPercent,
                                float normPower,
                                boolean sequentialAccessOutput)
                         throws java.io.IOException
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:
input - input directory of the vectors in SequenceFile format
output - output directory where RandomAccessSparseVector's of the document are generated
minDf - The minimum document frequency. Default 1
maxDFPercent - The max percentage of vectors for the DF. Can be used to remove really high frequency features. Expressed as an integer between 0 and 100. Default 99
chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
Throws:
java.io.IOException

getPath

public static org.apache.hadoop.fs.Path getPath(java.lang.String basePath,
                                                int index)


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.