|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.mahout.classifier.bayes.SplitBayesInput
public class SplitBayesInput
A utility for splitting files in the input format used by the Bayes classifiers into training and test sets in order to perform cross-validation. This class is not strictly confined to working with the Bayes classifier input. It can be used for any input files where each line is a complete sample.
This class can be used to split directories of files or individual files into training and test sets using a number of different methods.
When executed via splitDirectory(Path)
or splitFile(Path)
,
the lines read from one or more, input files are written to files of the same
name into the directories specified by the
setTestOutputDirectory(Path)
and
setTrainingOutputDirectory(Path)
methods.
The composition of the test set is determined using one of the following approaches:
setTestSplitSize(int)
or setTestSplitPct(int)
methods.
setTestSplitSize(int)
allocates a fixed number of items, while
setTestSplitPct(int)
allocates a percentage of the original input,
rounded up to the nearest integer. setSplitLocation(int)
is used to
control the position in the input from which the test data is extracted and
is described further below.setTestRandomSelectionSize(int)
or
setTestRandomSelectionPct(int)
methods, each choosing a fixed test
set size or percentage of the input set size as described above. The
RandomSampler
class from mahout-math
is used to create a sample
of the appropriate size.Any one of the methods above can be used to control the size of the test set. If multiple methods are called, a runtime exception will be thrown at execution time.
The setSplitLocation(int)
method is passed an integer from 0 to 100
(inclusive) which is translated into the position of the start of the test
data within the input file.
Given:
Nested Class Summary | |
---|---|
static interface |
SplitBayesInput.SplitCallback
Used to pass information back to a caller once a file has been split without the need for a data object |
Constructor Summary | |
---|---|
SplitBayesInput()
|
Method Summary | |
---|---|
static int |
countLines(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path inputFile,
Charset charset)
Count the lines in the file specified as returned by BufferedReader.readLine() |
SplitBayesInput.SplitCallback |
getCallback()
|
Charset |
getCharset()
|
org.apache.hadoop.fs.Path |
getInputDirectory()
|
int |
getSplitLocation()
|
org.apache.hadoop.fs.Path |
getTestOutputDirectory()
|
int |
getTestRandomSelectionPct()
|
int |
getTestRandomSelectionSize()
|
int |
getTestSplitPct()
|
int |
getTestSplitSize()
|
org.apache.hadoop.fs.Path |
getTrainingOutputDirectory()
|
static void |
main(String[] args)
|
boolean |
parseArgs(String[] args)
Configure this instance based on the command-line arguments contained within provided array. |
void |
setCallback(SplitBayesInput.SplitCallback callback)
Sets the callback used to inform the caller that an input file has been successfully split |
void |
setCharset(Charset charset)
Set the charset used to read and write files |
void |
setInputDirectory(org.apache.hadoop.fs.Path inputDir)
Set the directory from which input data will be read when the the splitDirectory() method is invoked |
void |
setSplitLocation(int splitLocation)
Set the location of the start of the test/training data split. |
void |
setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
Set the directory to which test data will be written. |
void |
setTestRandomSelectionPct(int randomSelectionPct)
Sets number of random input samples that will be saved to the test set as a percentage of the size of the input set. |
void |
setTestRandomSelectionSize(int testRandomSelectionSize)
Sets number of random input samples that will be saved to the test set. |
void |
setTestSplitPct(int testSplitPct)
Sets the percentage of the input data to allocate to the test split |
void |
setTestSplitSize(int testSplitSize)
|
void |
setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
Set the directory to which training data will be written. |
void |
splitDirectory()
Perform a split on directory specified by setInputDirectory(Path) by calling splitFile(Path)
on each file found within that directory. |
void |
splitDirectory(org.apache.hadoop.fs.Path inputDir)
Perform a split on the specified directory by calling splitFile(Path) on each file found within that
directory. |
void |
splitFile(org.apache.hadoop.fs.Path inputFile)
Perform a split on the specified input file. |
void |
validate()
Validates that the current instance is in a consistent state |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public SplitBayesInput() throws IOException
IOException
Method Detail |
---|
public static void main(String[] args) throws Exception
Exception
public boolean parseArgs(String[] args) throws Exception
validate()
to ensure consistency of configuration.
Exception
- if there is a problem parsing the command-line arguments or the particular
combination would violate class invariants.public void splitDirectory() throws IOException
setInputDirectory(Path)
by calling splitFile(Path)
on each file found within that directory.
IOException
public void splitDirectory(org.apache.hadoop.fs.Path inputDir) throws IOException
splitFile(Path)
on each file found within that
directory.
IOException
public void splitFile(org.apache.hadoop.fs.Path inputFile) throws IOException
validate()
method is called prior to executing the split.
IOException
public int getTestSplitSize()
public void setTestSplitSize(int testSplitSize)
public int getTestSplitPct()
public void setTestSplitPct(int testSplitPct)
testSplitPct
- a value between 0 and 100 inclusive.public int getSplitLocation()
public void setSplitLocation(int splitLocation)
This option is only relevant in cases where random selection is not employed
splitLocation
- a value between 0 and 100 inclusive.public Charset getCharset()
public void setCharset(Charset charset)
public org.apache.hadoop.fs.Path getInputDirectory()
public void setInputDirectory(org.apache.hadoop.fs.Path inputDir)
splitDirectory()
method is invoked
public org.apache.hadoop.fs.Path getTrainingOutputDirectory()
public void setTrainingOutputDirectory(org.apache.hadoop.fs.Path trainingOutputDir)
public org.apache.hadoop.fs.Path getTestOutputDirectory()
public void setTestOutputDirectory(org.apache.hadoop.fs.Path testOutputDir)
public SplitBayesInput.SplitCallback getCallback()
public void setCallback(SplitBayesInput.SplitCallback callback)
public int getTestRandomSelectionSize()
public void setTestRandomSelectionSize(int testRandomSelectionSize)
public int getTestRandomSelectionPct()
public void setTestRandomSelectionPct(int randomSelectionPct)
randomSelectionPct
- a value between 0 and 100 inclusive.public void validate() throws IOException
IllegalArgumentException
- if settings violate class invariants.
IOException
- if output directories do not exist or are not directories.public static int countLines(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path inputFile, Charset charset) throws IOException
BufferedReader.readLine()
inputFile
- the file whose lines will be countedcharset
- the charset of the file to read
IOException
- if there is a problem opening or reading the file.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |