org.apache.mahout.vectorizer.tfidf
Class TFIDFConverter

java.lang.Object
  extended by org.apache.mahout.vectorizer.tfidf.TFIDFConverter

public final class TFIDFConverter
extends Object

This class converts a set of input vectors with term frequencies to TfIdf vectors. The Sequence file input should have a WritableComparable key containing and a VectorWritable value containing the term frequency vector. This is conversion class uses multiple map/reduces to convert the vectors to TfIdf format


Field Summary
static String FEATURE_COUNT
           
static String MAX_DF_PERCENTAGE
           
static String MIN_DF
           
static String VECTOR_COUNT
           
 
Method Summary
static void processTfIdf(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, org.apache.hadoop.conf.Configuration baseConf, int chunkSizeInMegabytes, int minDf, int maxDFPercent, float normPower, boolean logNormalize, boolean sequentialAccessOutput, boolean namedVector, int numReducers)
          Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

VECTOR_COUNT

public static final String VECTOR_COUNT
See Also:
Constant Field Values

FEATURE_COUNT

public static final String FEATURE_COUNT
See Also:
Constant Field Values

MIN_DF

public static final String MIN_DF
See Also:
Constant Field Values

MAX_DF_PERCENTAGE

public static final String MAX_DF_PERCENTAGE
See Also:
Constant Field Values
Method Detail

processTfIdf

public static void processTfIdf(org.apache.hadoop.fs.Path input,
                                org.apache.hadoop.fs.Path output,
                                org.apache.hadoop.conf.Configuration baseConf,
                                int chunkSizeInMegabytes,
                                int minDf,
                                int maxDFPercent,
                                float normPower,
                                boolean logNormalize,
                                boolean sequentialAccessOutput,
                                boolean namedVector,
                                int numReducers)
                         throws IOException,
                                InterruptedException,
                                ClassNotFoundException
Create Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors from the input set of vectors in SequenceFile format. This job uses a fixed limit on the maximum memory used by the feature chunk per node thereby splitting the process across multiple map/reduces.

Parameters:
input - input directory of the vectors in SequenceFile format
output - output directory where RandomAccessSparseVector's of the document are generated
chunkSizeInMegabytes - the size in MB of the feature => id chunk to be kept in memory at each node during Map/Reduce stage. Its recommended you calculated this based on the number of cores and the free memory available to you per node. Say, you have 2 cores and around 1GB extra memory to spare we recommend you use a split size of around 400-500MB so that two simultaneous reducers can create partial vectors without thrashing the system due to increased swapping
minDf - The minimum document frequency. Default 1
maxDFPercent - The max percentage of vectors for the DF. Can be used to remove really high frequency features. Expressed as an integer between 0 and 100. Default 99
numReducers - The number of reducers to spawn. This also affects the possible parallelism since each reducer will typically produce a single output file containing tf-idf vectors for a subset of the documents in the corpus.
Throws:
IOException
InterruptedException
ClassNotFoundException


Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.