org.apache.mahout.utils.vectors.text
Class DocumentProcessor

java.lang.Object
  extended by org.apache.mahout.utils.vectors.text.DocumentProcessor

public final class DocumentProcessor
extends java.lang.Object

This class converts a set of input documents in the sequence file format of StringTuples.The SequenceFile input should have a Text key containing the unique document identifier and a Text value containing the whole document. The document should be stored in UTF-8 encoding which is recognizable by hadoop. It uses the given Analyzer to process the document into Tokens.


Field Summary
static java.lang.String ANALYZER_CLASS
           
static java.nio.charset.Charset CHARSET
           
static java.lang.String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
           
 
Method Summary
static void tokenizeDocuments(java.lang.String input, java.lang.Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass, java.lang.String output)
          Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOKENIZED_DOCUMENT_OUTPUT_FOLDER

public static final java.lang.String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
See Also:
Constant Field Values

ANALYZER_CLASS

public static final java.lang.String ANALYZER_CLASS
See Also:
Constant Field Values

CHARSET

public static final java.nio.charset.Charset CHARSET
Method Detail

tokenizeDocuments

public static void tokenizeDocuments(java.lang.String input,
                                     java.lang.Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
                                     java.lang.String output)
                              throws java.io.IOException
Convert the input documents into token array using the StringTuple The input documents has to be in the SequenceFile format

Parameters:
input - input directory of the documents in SequenceFile format
output - output directory were the StringTuple token array of each document has to be created
analyzerClass - The Lucene Analyzer for tokenizing the UTF-8 text
Throws:
java.io.IOException


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.