org.apache.mahout.vectorizer.tfidf
Class TFIDFConverter

java.lang.Object
  extended by org.apache.mahout.vectorizer.tfidf.TFIDFConverter

public final class TFIDFConverter
extends java.lang.Object

This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input should have a WritableComparable key containing and a VectorWritable value containing the term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf format


Field Summary
static java.lang.String FEATURE_COUNT
           
static java.lang.String MAX_DF_PERCENTAGE
           
static java.lang.String MIN_DF
           
static java.lang.String VECTOR_COUNT
           
 
Method Summary
static void processTfIdf(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, int chunkSizeInMegabytes, int minDf, int maxDFPercent, float normPower, boolean logNormalize, boolean sequentialAccessOutput, boolean namedVector, int numReducers)
          Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VECTOR_COUNT

public static final java.lang.String VECTOR_COUNT
See Also:
Constant Field Values

FEATURE_COUNT

public static final java.lang.String FEATURE_COUNT
See Also:
Constant Field Values

MIN_DF

public static final java.lang.String MIN_DF
See Also:
Constant Field Values

MAX_DF_PERCENTAGE

public static final java.lang.String MAX_DF_PERCENTAGE
See Also:
Constant Field Values
Method Detail

processTfIdf

public static void processTfIdf(org.apache.hadoop.fs.Path input,
                                org.apache.hadoop.fs.Path output,
                                int chunkSizeInMegabytes,
                                int minDf,
                                int maxDFPercent,
                                float normPower,
                                boolean logNormalize,
                                boolean sequentialAccessOutput,
                                boolean namedVector,
                                int numReducers)
                         throws java.io.IOException,
                                java.lang.InterruptedException,
                                java.lang.ClassNotFoundException
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:
input - input directory of the vectors in SequenceFile format
output - output directory where RandomAccessSparseVector's of the document are generated
chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
minDf - The minimum document frequency. Default 1
maxDFPercent - The max percentage of vectors for the DF. Can be used to remove really high frequency features. Expressed as an integer between 0 and 100. Default 99
numReducers - The number of reducers to spawn. This also affects the possible parallelism since each reducer will typically produce a single output file containing tf-idf vectors for a subset of the documents in the corpus.
Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.