org.apache.mahout.text
Class SequenceFilesFromDirectory

java.lang.Object
  extended by org.apache.mahout.text.SequenceFilesFromDirectory

public final class SequenceFilesFromDirectory
extends Object

Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a parent directory containing sub folders of text documents and recursively reads the files and creates the SequenceFiles of docid => content. The docid is set as the relative path of the document from the parent directory prepended with a specified prefix. You can also specify the input encoding of the text files. The content of the output SequenceFiles are encoded as UTF-8 text.


Nested Class Summary
static class SequenceFilesFromDirectory.ChunkedWriter
           
 class SequenceFilesFromDirectory.PrefixAdditionFilter
           
 
Constructor Summary
SequenceFilesFromDirectory()
           
 
Method Summary
 void createSequenceFiles(File parentDir, String outputDir, String prefix, int chunkSizeInMB, Charset charset)
           
static void main(String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SequenceFilesFromDirectory

public SequenceFilesFromDirectory()
Method Detail

createSequenceFiles

public void createSequenceFiles(File parentDir,
                                String outputDir,
                                String prefix,
                                int chunkSizeInMB,
                                Charset charset)
                         throws IOException
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.