org.apache.mahout.text
Class SequenceFilesFromDirectory
java.lang.Object
org.apache.mahout.text.SequenceFilesFromDirectory
public final class SequenceFilesFromDirectory
- extends java.lang.Object
Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a
parent directory containing sub folders of text documents and recursively reads the files and creates the
SequenceFile
s of docid => content. The docid is set as the relative path of the document from the
parent directory prepended with a specified prefix. You can also specify the input encoding of the text
files. The content of the output SequenceFiles are encoded as UTF-8 text.
Method Summary |
void |
createSequenceFiles(java.io.File parentDir,
java.lang.String outputDir,
java.lang.String prefix,
int chunkSizeInMB,
java.nio.charset.Charset charset)
|
static void |
main(java.lang.String[] args)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SequenceFilesFromDirectory
public SequenceFilesFromDirectory()
createSequenceFiles
public void createSequenceFiles(java.io.File parentDir,
java.lang.String outputDir,
java.lang.String prefix,
int chunkSizeInMB,
java.nio.charset.Charset charset)
throws java.io.IOException
- Throws:
java.io.IOException
main
public static void main(java.lang.String[] args)
throws java.lang.Exception
- Throws:
java.lang.Exception
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.