org.apache.mahout.text
Class SequenceFilesFromDirectory

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.text.SequenceFilesFromDirectory
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class SequenceFilesFromDirectory
extends AbstractJob

Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a parent directory containing sub folders of text documents and recursively reads the files and creates the SequenceFiles of docid => content. The docid is set as the relative path of the document from the parent directory prepended with a specified prefix. You can also specify the input encoding of the text files. The content of the output SequenceFiles are encoded as UTF-8 text.


Constructor Summary
SequenceFilesFromDirectory()
           
 
Method Summary
protected  void addOptions()
          Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job.
static void main(String[] args)
           
protected  Map<String,String> parseOptions()
          Override this method in order to parse your additional options from the command line.
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getCombinedTempPath, getGroup, getInputPath, getOption, getOption, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Constructor Detail

SequenceFilesFromDirectory

public SequenceFilesFromDirectory()
Method Detail

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Throws:
Exception

addOptions

protected void addOptions()
Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job. Do not forget to call super() otherwise all standard options (input/output dirs etc) will not be available.


parseOptions

protected Map<String,String> parseOptions()
Override this method in order to parse your additional options from the command line. Do not forget to call super() otherwise standard options (input/output dirs etc) will not be available.



Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.