org.apache.mahout.text
Class SequenceFilesFromDirectory
java.lang.Object
org.apache.mahout.text.SequenceFilesFromDirectory
public final class SequenceFilesFromDirectory
- extends Object
Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a
parent directory containing sub folders of text documents and recursively reads the files and creates the
SequenceFile
s of docid => content. The docid is set as the relative path of the document from the
parent directory prepended with a specified prefix. You can also specify the input encoding of the text
files. The content of the output SequenceFiles are encoded as UTF-8 text.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SequenceFilesFromDirectory
public SequenceFilesFromDirectory()
createSequenceFiles
public void createSequenceFiles(File parentDir,
String outputDir,
String prefix,
int chunkSizeInMB,
Charset charset)
throws IOException
- Throws:
IOException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.