org.apache.mahout.vectorizer
Class DocumentProcessor

java.lang.Object
  extended by org.apache.mahout.vectorizer.DocumentProcessor

public final class DocumentProcessor
extends java.lang.Object

This class converts a set of input documents in the sequence file format of StringTuples.The SequenceFile input should have a Text key containing the unique document identifier and a Text value containing the whole document. The document should be stored in UTF-8 encoding which is recognizable by hadoop. It uses the given Analyzer to process the document into Tokens.


Field Summary
static java.lang.String ANALYZER_CLASS
           
static java.lang.String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
           
 
Method Summary
static void tokenizeDocuments(org.apache.hadoop.fs.Path input, java.lang.Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass, org.apache.hadoop.fs.Path output)
          Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOKENIZED_DOCUMENT_OUTPUT_FOLDER

public static final java.lang.String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
See Also:
Constant Field Values

ANALYZER_CLASS

public static final java.lang.String ANALYZER_CLASS
See Also:
Constant Field Values
Method Detail

tokenizeDocuments

public static void tokenizeDocuments(org.apache.hadoop.fs.Path input,
                                     java.lang.Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
                                     org.apache.hadoop.fs.Path output)
                              throws java.io.IOException,
                                     java.lang.InterruptedException,
                                     java.lang.ClassNotFoundException
Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format

Parameters:
input - input directory of the documents in SequenceFile format
output - output directory were the StringTuple token array of each document has to be created
analyzerClass - The Lucene Analyzer for tokenizing the UTF-8 text
Throws:
java.io.IOException
java.lang.ClassNotFoundException
java.lang.InterruptedException


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.