org.apache.mahout.classifier.bayes
Class SplitBayesInput

java.lang.Object
  extended by org.apache.mahout.classifier.bayes.SplitBayesInput

public class SplitBayesInput
extends Object

A utility for splitting files in the input format used by the Bayes classifiers into training and test sets in order to perform cross-validation. This class is not strictly confined to working with the Bayes classifier input. It can be used for any input files where each line is a complete sample.

This class can be used to split directories of files or individual files into training and test sets using a number of different methods.

When executed via splitDirectory(Path) or splitFile(Path), the lines read from one or more, input files are written to files of the same name into the directories specified by the setTestOutputDirectory(Path) and setTrainingOutputDirectory(Path) methods.

The composition of the test set is determined using one of the following approaches:

Any one of the methods above can be used to control the size of the test set. If multiple methods are called, a runtime exception will be thrown at execution time.

The setSplitLocation(int) method is passed an integer from 0 to 100 (inclusive) which is translated into the position of the start of the test data within the input file.

Given:

The start of the split will always be adjusted forwards in order to ensure that the desired test set size is allocated. Split location has no effect is random sampling is employed.


Nested Class Summary
static interface SplitBayesInput.SplitCallback
          Used to pass information back to a caller once a file has been split without the need for a data object
 
Constructor Summary
SplitBayesInput()
           
 
Method Summary
static int countLines(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path inputFile, Charset charset)
          Count the lines in the file specified as returned by BufferedReader.readLine()
 SplitBayesInput.SplitCallback getCallback()
           
 Charset getCharset()
           
 org.apache.hadoop.fs.Path getInputDirectory()
           
 int getSplitLocation()
           
 org.apache.hadoop.fs.Path getTestOutputDirectory()
           
 int getTestRandomSelectionPct()
           
 int getTestRandomSelectionSize()
           
 int getTestSplitPct()
           
 int getTestSplitSize()
           
 org.apache.hadoop.fs.Path getTrainingOutputDirectory()
           
static void main(String[] args)
           
 boolean parseArgs(String[] args)
          Configure this instance based on the command-line arguments contained within provided array.
 void setCallback(SplitBayesInput.SplitCallback callback)
          Sets the callback used to inform the caller that an input file has been successfully split
 void setCharset(Charset charset)
          Set the charset used to read and write files
 void setInputDirectory(org.apache.hadoop.fs.Path inputDir)
          Set the directory from which input data will be read when the the splitDirectory() method is invoked
 void setSplitLocation(int splitLocation)
          Set the location of the start of the test/training data split.
 void setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
          Set the directory to which test data will be written.
 void setTestRandomSelectionPct(int randomSelectionPct)
          Sets number of random input samples that will be saved to the test set as a percentage of the size of the input set.
 void setTestRandomSelectionSize(int testRandomSelectionSize)
          Sets number of random input samples that will be saved to the test set.
 void setTestSplitPct(int testSplitPct)
          Sets the percentage of the input data to allocate to the test split
 void setTestSplitSize(int testSplitSize)
           
 void setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
          Set the directory to which training data will be written.
 void splitDirectory()
          Perform a split on directory specified by setInputDirectory(Path) by calling splitFile(Path) on each file found within that directory.
 void splitDirectory(org.apache.hadoop.fs.Path inputDir)
          Perform a split on the specified directory by calling splitFile(Path) on each file found within that directory.
 void splitFile(org.apache.hadoop.fs.Path inputFile)
          Perform a split on the specified input file.
 void validate()
          Validates that the current instance is in a consistent state
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SplitBayesInput

public SplitBayesInput()
                throws IOException
Throws:
IOException
Method Detail

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

parseArgs

public boolean parseArgs(String[] args)
                  throws Exception
Configure this instance based on the command-line arguments contained within provided array. Calls validate() to ensure consistency of configuration.

Returns:
true if the arguments were parsed successfully and execution should proceed.
Throws:
Exception - if there is a problem parsing the command-line arguments or the particular combination would violate class invariants.

splitDirectory

public void splitDirectory()
                    throws IOException
Perform a split on directory specified by setInputDirectory(Path) by calling splitFile(Path) on each file found within that directory.

Throws:
IOException

splitDirectory

public void splitDirectory(org.apache.hadoop.fs.Path inputDir)
                    throws IOException
Perform a split on the specified directory by calling splitFile(Path) on each file found within that directory.

Throws:
IOException

splitFile

public void splitFile(org.apache.hadoop.fs.Path inputFile)
               throws IOException
Perform a split on the specified input file. Results will be written to files of the same name in the specified training and test output directories. The validate() method is called prior to executing the split.

Throws:
IOException

getTestSplitSize

public int getTestSplitSize()

setTestSplitSize

public void setTestSplitSize(int testSplitSize)

getTestSplitPct

public int getTestSplitPct()

setTestSplitPct

public void setTestSplitPct(int testSplitPct)
Sets the percentage of the input data to allocate to the test split

Parameters:
testSplitPct - a value between 0 and 100 inclusive.

getSplitLocation

public int getSplitLocation()

setSplitLocation

public void setSplitLocation(int splitLocation)
Set the location of the start of the test/training data split. Expressed as percentage of lines, for example 0 indicates that the test data should be taken from the start of the file, 100 indicates that the test data should be taken from the end of the input file, while 25 indicates that the test data should be taken from the first quarter of the file.

This option is only relevant in cases where random selection is not employed

Parameters:
splitLocation - a value between 0 and 100 inclusive.

getCharset

public Charset getCharset()

setCharset

public void setCharset(Charset charset)
Set the charset used to read and write files


getInputDirectory

public org.apache.hadoop.fs.Path getInputDirectory()

setInputDirectory

public void setInputDirectory(org.apache.hadoop.fs.Path inputDir)
Set the directory from which input data will be read when the the splitDirectory() method is invoked


getTrainingOutputDirectory

public org.apache.hadoop.fs.Path getTrainingOutputDirectory()

setTrainingOutputDirectory

public void setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
Set the directory to which training data will be written.


getTestOutputDirectory

public org.apache.hadoop.fs.Path getTestOutputDirectory()

setTestOutputDirectory

public void setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
Set the directory to which test data will be written.


getCallback

public SplitBayesInput.SplitCallback getCallback()

setCallback

public void setCallback(SplitBayesInput.SplitCallback callback)
Sets the callback used to inform the caller that an input file has been successfully split


getTestRandomSelectionSize

public int getTestRandomSelectionSize()

setTestRandomSelectionSize

public void setTestRandomSelectionSize(int testRandomSelectionSize)
Sets number of random input samples that will be saved to the test set.


getTestRandomSelectionPct

public int getTestRandomSelectionPct()

setTestRandomSelectionPct

public void setTestRandomSelectionPct(int randomSelectionPct)
Sets number of random input samples that will be saved to the test set as a percentage of the size of the input set.

Parameters:
randomSelectionPct - a value between 0 and 100 inclusive.

validate

public void validate()
              throws IOException
Validates that the current instance is in a consistent state

Throws:
IllegalArgumentException - if settings violate class invariants.
IOException - if output directories do not exist or are not directories.

countLines

public static int countLines(org.apache.hadoop.fs.FileSystem fs,
                             org.apache.hadoop.fs.Path inputFile,
                             Charset charset)
                      throws IOException
Count the lines in the file specified as returned by BufferedReader.readLine()

Parameters:
inputFile - the file whose lines will be counted
charset - the charset of the file to read
Returns:
the number of lines in the input file.
Throws:
IOException - if there is a problem opening or reading the file.


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.