org.apache.mahout.df.mapreduce
Class Builder

java.lang.Object
  extended by org.apache.mahout.df.mapreduce.Builder
Direct Known Subclasses:
InMemBuilder, PartialBuilder

public abstract class Builder
extends java.lang.Object

Base class for Mapred DecisionForest builders. Takes care of storing the parameters common to the mapred implementations.
The child classes must implement at least :


Constructor Summary
protected Builder(TreeBuilder treeBuilder, org.apache.hadoop.fs.Path dataPath, org.apache.hadoop.fs.Path datasetPath, java.lang.Long seed, org.apache.hadoop.conf.Configuration conf)
           
 
Method Summary
 DecisionForest build(int nbTrees, PredictionCallback callback)
           
protected abstract  void configureJob(org.apache.hadoop.mapreduce.Job job, int nbTrees, boolean oobEstimate)
          Used by the inheriting classes to configure the job
protected  org.apache.hadoop.fs.Path getDataPath()
           
protected  org.apache.hadoop.fs.Path getDatasetPath()
           
static org.apache.hadoop.fs.Path getDistributedCacheFile(org.apache.hadoop.conf.Configuration conf, int index)
          Helper method.
static int getNbTrees(org.apache.hadoop.conf.Configuration conf)
          Get the number of trees for the map-reduce job.
static int getNumMaps(org.apache.hadoop.conf.Configuration conf)
          Return the value of "mapred.map.tasks".
 org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.conf.Configuration conf)
          Output Directory name
static java.lang.Long getRandomSeed(org.apache.hadoop.conf.Configuration conf)
          Returns the random seed
protected  java.lang.Long getSeed()
           
protected  TreeBuilder getTreeBuilder()
           
static TreeBuilder getTreeBuilder(org.apache.hadoop.conf.Configuration conf)
           
protected static boolean isOobEstimate(org.apache.hadoop.conf.Configuration conf)
           
protected static boolean isOutput(org.apache.hadoop.conf.Configuration conf)
          Used only for DEBUG purposes.
static Dataset loadDataset(org.apache.hadoop.conf.Configuration conf)
          Helper method.
protected abstract  DecisionForest parseOutput(org.apache.hadoop.mapreduce.Job job, PredictionCallback callback)
          Parse the output files to extract the trees and pass the predictions to the callback
protected  boolean runJob(org.apache.hadoop.mapreduce.Job job)
          Sequential implementation should override this method to simulate the job execution
static void setNbTrees(org.apache.hadoop.conf.Configuration conf, int nbTrees)
          Set the number of trees to grow for the map-reduce job
 void setOutputDirName(java.lang.String name)
          Sets the Output directory name, will be creating in the working directory
static void sortSplits(org.apache.hadoop.mapreduce.InputSplit[] splits)
          sort the splits into order based on size, so that the biggest go first.
This is the same code used by Hadoop's JobClient.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Builder

protected Builder(TreeBuilder treeBuilder,
                  org.apache.hadoop.fs.Path dataPath,
                  org.apache.hadoop.fs.Path datasetPath,
                  java.lang.Long seed,
                  org.apache.hadoop.conf.Configuration conf)
Method Detail

getTreeBuilder

protected TreeBuilder getTreeBuilder()

getDataPath

protected org.apache.hadoop.fs.Path getDataPath()

getDatasetPath

protected org.apache.hadoop.fs.Path getDatasetPath()

getSeed

protected java.lang.Long getSeed()

getNumMaps

public static int getNumMaps(org.apache.hadoop.conf.Configuration conf)
Return the value of "mapred.map.tasks". In case the 'local' runner is detected, returns 1

Parameters:
conf - configuration
Returns:
number of map tasks

isOutput

protected static boolean isOutput(org.apache.hadoop.conf.Configuration conf)
Used only for DEBUG purposes. if false, the mappers doesn't output anything, so the builder has nothing to process

Parameters:
conf - configuration
Returns:
true if the builder has to return output. false otherwise

isOobEstimate

protected static boolean isOobEstimate(org.apache.hadoop.conf.Configuration conf)

getRandomSeed

public static java.lang.Long getRandomSeed(org.apache.hadoop.conf.Configuration conf)
Returns the random seed

Parameters:
conf - configuration
Returns:
null if no seed is available

getTreeBuilder

public static TreeBuilder getTreeBuilder(org.apache.hadoop.conf.Configuration conf)

getNbTrees

public static int getNbTrees(org.apache.hadoop.conf.Configuration conf)
Get the number of trees for the map-reduce job.

Parameters:
conf - configuration
Returns:
number of trees to build

setNbTrees

public static void setNbTrees(org.apache.hadoop.conf.Configuration conf,
                              int nbTrees)
Set the number of trees to grow for the map-reduce job

Parameters:
conf - configuration
nbTrees - number of trees to build
Throws:
java.lang.IllegalArgumentException - if (nbTrees <= 0)

setOutputDirName

public void setOutputDirName(java.lang.String name)
Sets the Output directory name, will be creating in the working directory

Parameters:
name - output dir. name

getOutputPath

public org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.conf.Configuration conf)
                                        throws java.io.IOException
Output Directory name

Parameters:
conf - configuration
Returns:
output dir. path (%WORKING_DIRECTORY%/OUTPUT_DIR_NAME%)
Throws:
java.io.IOException - if we cannot get the default FileSystem

getDistributedCacheFile

public static org.apache.hadoop.fs.Path getDistributedCacheFile(org.apache.hadoop.conf.Configuration conf,
                                                                int index)
                                                         throws java.io.IOException
Helper method. Get a path from the DistributedCache

Parameters:
conf - configuration
index - index of the path in the DistributedCache files
Returns:
path from the DistributedCache
Throws:
java.io.IOException - if no path is found

loadDataset

public static Dataset loadDataset(org.apache.hadoop.conf.Configuration conf)
                           throws java.io.IOException
Helper method. Load a Dataset stored in the DistributedCache

Parameters:
conf - configuration
Returns:
loaded Dataset
Throws:
java.io.IOException - if we cannot retrieve the Dataset path from the DistributedCache, or the Dataset could not be loaded

configureJob

protected abstract void configureJob(org.apache.hadoop.mapreduce.Job job,
                                     int nbTrees,
                                     boolean oobEstimate)
                              throws java.io.IOException
Used by the inheriting classes to configure the job

Parameters:
job - Hadoop's Job
nbTrees - number of trees to grow
oobEstimate - true, if oob error should be estimated
Throws:
java.io.IOException - if anything goes wrong while configuring the job

runJob

protected boolean runJob(org.apache.hadoop.mapreduce.Job job)
                  throws java.lang.ClassNotFoundException,
                         java.io.IOException,
                         java.lang.InterruptedException
Sequential implementation should override this method to simulate the job execution

Parameters:
job - Hadoop's job
Returns:
true is the job succeeded
Throws:
java.lang.ClassNotFoundException
java.io.IOException
java.lang.InterruptedException

parseOutput

protected abstract DecisionForest parseOutput(org.apache.hadoop.mapreduce.Job job,
                                              PredictionCallback callback)
                                       throws java.io.IOException,
                                              java.lang.ClassNotFoundException,
                                              java.lang.InterruptedException
Parse the output files to extract the trees and pass the predictions to the callback

Parameters:
job - Hadoop's job
callback - can be null
Returns:
Built DecisionForest
Throws:
java.io.IOException - if anything goes wrong while parsing the output
java.lang.ClassNotFoundException
java.lang.InterruptedException

build

public DecisionForest build(int nbTrees,
                            PredictionCallback callback)
                     throws java.io.IOException,
                            java.lang.ClassNotFoundException,
                            java.lang.InterruptedException
Throws:
java.io.IOException
java.lang.ClassNotFoundException
java.lang.InterruptedException

sortSplits

public static void sortSplits(org.apache.hadoop.mapreduce.InputSplit[] splits)
sort the splits into order based on size, so that the biggest go first.
This is the same code used by Hadoop's JobClient.

Parameters:
splits - input splits


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.