org.apache.mahout.text
Class SequenceFilesFromDirectory
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.mahout.common.AbstractJob
org.apache.mahout.text.SequenceFilesFromDirectory
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
- Direct Known Subclasses:
- SequenceFilesFromDirectoryFilter
public class SequenceFilesFromDirectory
- extends AbstractJob
Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a
parent directory containing sub folders of text documents and recursively reads the files and creates the
SequenceFile
s of docid => content. The docid is set as the relative path of the document from the
parent directory prepended with a specified prefix. You can also specify the input encoding of the text
files. The content of the output SequenceFiles are encoded as UTF-8 text.
Method Summary |
protected void |
addOptions()
Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job. |
static void |
main(String[] args)
|
protected Map<String,String> |
parseOptions()
Override this method in order to parse your additional options from the command line. |
void |
run(org.apache.hadoop.conf.Configuration conf,
String keyPrefix,
Map<String,String> options,
org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output)
|
int |
run(String[] args)
|
Methods inherited from class org.apache.mahout.common.AbstractJob |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, getInputPath, getOption, getOutputPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, shouldRunNextPhase |
Methods inherited from class org.apache.hadoop.conf.Configured |
getConf, setConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
getConf, setConf |
CHUNK_SIZE_OPTION
public static final String[] CHUNK_SIZE_OPTION
FILE_FILTER_CLASS_OPTION
public static final String[] FILE_FILTER_CLASS_OPTION
KEY_PREFIX_OPTION
public static final String[] KEY_PREFIX_OPTION
CHARSET_OPTION
public static final String[] CHARSET_OPTION
SequenceFilesFromDirectory
public SequenceFilesFromDirectory()
run
public void run(org.apache.hadoop.conf.Configuration conf,
String keyPrefix,
Map<String,String> options,
org.apache.hadoop.fs.Path input,
org.apache.hadoop.fs.Path output)
throws InstantiationException,
IllegalAccessException,
InvocationTargetException,
IOException,
NoSuchMethodException,
ClassNotFoundException
- Throws:
InstantiationException
IllegalAccessException
InvocationTargetException
IOException
NoSuchMethodException
ClassNotFoundException
main
public static void main(String[] args)
throws Exception
- Throws:
Exception
run
public int run(String[] args)
throws IOException,
ClassNotFoundException,
InstantiationException,
IllegalAccessException,
NoSuchMethodException,
InvocationTargetException
- Throws:
IOException
ClassNotFoundException
InstantiationException
IllegalAccessException
NoSuchMethodException
InvocationTargetException
addOptions
protected void addOptions()
- Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job.
Do not forget to call super() otherwise all standard options (input/output dirs etc) will not be available.
parseOptions
protected Map<String,String> parseOptions()
throws IOException
- Override this method in order to parse your additional options from the command line. Do not forget to call
super() otherwise standard options (input/output dirs etc) will not be available.
- Throws:
IOException
Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.