org.apache.mahout.text
Class SequenceFilesFromDirectory

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.text.SequenceFilesFromDirectory
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
Direct Known Subclasses:
SequenceFilesFromDirectoryFilter

public class SequenceFilesFromDirectory
extends AbstractJob

Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a parent directory containing sub folders of text documents and recursively reads the files and creates the SequenceFiles of docid => content. The docid is set as the relative path of the document from the parent directory prepended with a specified prefix. You can also specify the input encoding of the text files. The content of the output SequenceFiles are encoded as UTF-8 text.


Field Summary
static String[] CHARSET_OPTION
           
static String[] CHUNK_SIZE_OPTION
           
static String[] FILE_FILTER_CLASS_OPTION
           
static String[] KEY_PREFIX_OPTION
           
 
Constructor Summary
SequenceFilesFromDirectory()
           
 
Method Summary
protected  void addOptions()
          Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job.
static void main(String[] args)
           
protected  Map<String,String> parseOptions()
          Override this method in order to parse your additional options from the command line.
 void run(org.apache.hadoop.conf.Configuration conf, String keyPrefix, Map<String,String> options, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, getInputPath, getOption, getOutputPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, shouldRunNextPhase
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

CHUNK_SIZE_OPTION

public static final String[] CHUNK_SIZE_OPTION

FILE_FILTER_CLASS_OPTION

public static final String[] FILE_FILTER_CLASS_OPTION

KEY_PREFIX_OPTION

public static final String[] KEY_PREFIX_OPTION

CHARSET_OPTION

public static final String[] CHARSET_OPTION
Constructor Detail

SequenceFilesFromDirectory

public SequenceFilesFromDirectory()
Method Detail

run

public void run(org.apache.hadoop.conf.Configuration conf,
                String keyPrefix,
                Map<String,String> options,
                org.apache.hadoop.fs.Path input,
                org.apache.hadoop.fs.Path output)
         throws InstantiationException,
                IllegalAccessException,
                InvocationTargetException,
                IOException,
                NoSuchMethodException,
                ClassNotFoundException
Throws:
InstantiationException
IllegalAccessException
InvocationTargetException
IOException
NoSuchMethodException
ClassNotFoundException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws IOException,
               ClassNotFoundException,
               InstantiationException,
               IllegalAccessException,
               NoSuchMethodException,
               InvocationTargetException
Throws:
IOException
ClassNotFoundException
InstantiationException
IllegalAccessException
NoSuchMethodException
InvocationTargetException

addOptions

protected void addOptions()
Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job. Do not forget to call super() otherwise all standard options (input/output dirs etc) will not be available.


parseOptions

protected Map<String,String> parseOptions()
                                   throws IOException
Override this method in order to parse your additional options from the command line. Do not forget to call super() otherwise standard options (input/output dirs etc) will not be available.

Throws:
IOException


Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.