org.apache.mahout.utils.vectors.text
Class DocumentProcessor
java.lang.Object
org.apache.mahout.utils.vectors.text.DocumentProcessor
public final class DocumentProcessor
- extends java.lang.Object
This class converts a set of input documents in the sequence file format of StringTuple
s.The
SequenceFile
input should have a Text
key containing the unique document identifier and a
Text
value containing the whole document. The document should be stored in UTF-8 encoding which is
recognizable by hadoop. It uses the given Analyzer
to process the document into
Token
s.
Method Summary |
static void |
tokenizeDocuments(java.lang.String input,
java.lang.Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
java.lang.String output)
Convert the input documents into token array using the StringTuple The input documents has to be
in the SequenceFile format |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TOKENIZED_DOCUMENT_OUTPUT_FOLDER
public static final java.lang.String TOKENIZED_DOCUMENT_OUTPUT_FOLDER
- See Also:
- Constant Field Values
ANALYZER_CLASS
public static final java.lang.String ANALYZER_CLASS
- See Also:
- Constant Field Values
CHARSET
public static final java.nio.charset.Charset CHARSET
tokenizeDocuments
public static void tokenizeDocuments(java.lang.String input,
java.lang.Class<? extends org.apache.lucene.analysis.Analyzer> analyzerClass,
java.lang.String output)
throws java.io.IOException
- Convert the input documents into token array using the
StringTuple
The input documents has to be
in the SequenceFile
format
- Parameters:
input
- input directory of the documents in SequenceFile
formatoutput
- output directory were the StringTuple
token array of each document has to be createdanalyzerClass
- The Lucene Analyzer
for tokenizing the UTF-8 text
- Throws:
java.io.IOException
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.