org.apache.mahout.clustering.dirichlet
Class DirichletDriver

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.clustering.dirichlet.DirichletDriver
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class DirichletDriver
extends AbstractJob


Field Summary
static String ALPHA_0_KEY
           
static String ALPHA_OPTION
           
static String EMIT_MOST_LIKELY_KEY
           
static String MODEL_DISTRIBUTION_CLASS_OPTION
           
static String MODEL_DISTRIBUTION_KEY
           
static String MODEL_PROTOTYPE_CLASS_OPTION
           
static String NUM_CLUSTERS_KEY
           
static String STATE_IN_KEY
           
static String THRESHOLD_KEY
           
 
Constructor Summary
DirichletDriver()
           
 
Method Summary
static org.apache.hadoop.fs.Path buildClusters(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, DistributionDescription description, int numClusters, int maxIterations, double alpha0, boolean runSequential)
          Iterate over the input vectors to produce cluster directories for each iteration
static void clusterData(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path stateIn, org.apache.hadoop.fs.Path output, boolean emitMostLikely, double threshold, boolean runSequential)
          Run the job using supplied arguments
static void main(String[] args)
           
static int readPrototypeSize(org.apache.hadoop.fs.Path input)
          Read the first input vector to determine the prototype size for the modelPrototype
static void run(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, DistributionDescription description, int numModels, int maxIterations, double alpha0, boolean runClustering, boolean emitMostLikely, double threshold, boolean runSequential)
          Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to cluster the input vectors.
static void run(org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output, DistributionDescription description, int numClusters, int maxIterations, double alpha0, boolean runClustering, boolean emitMostLikely, double threshold, boolean runSequential)
          Convenience method provides default Configuration Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to cluster the input vectors.
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getCombinedTempPath, getGroup, getInputPath, getOption, getOption, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

STATE_IN_KEY

public static final String STATE_IN_KEY
See Also:
Constant Field Values

MODEL_DISTRIBUTION_KEY

public static final String MODEL_DISTRIBUTION_KEY
See Also:
Constant Field Values

NUM_CLUSTERS_KEY

public static final String NUM_CLUSTERS_KEY
See Also:
Constant Field Values

ALPHA_0_KEY

public static final String ALPHA_0_KEY
See Also:
Constant Field Values

EMIT_MOST_LIKELY_KEY

public static final String EMIT_MOST_LIKELY_KEY
See Also:
Constant Field Values

THRESHOLD_KEY

public static final String THRESHOLD_KEY
See Also:
Constant Field Values

MODEL_PROTOTYPE_CLASS_OPTION

public static final String MODEL_PROTOTYPE_CLASS_OPTION
See Also:
Constant Field Values

MODEL_DISTRIBUTION_CLASS_OPTION

public static final String MODEL_DISTRIBUTION_CLASS_OPTION
See Also:
Constant Field Values

ALPHA_OPTION

public static final String ALPHA_OPTION
See Also:
Constant Field Values
Constructor Detail

DirichletDriver

public DirichletDriver()
Method Detail

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Throws:
Exception

run

public static void run(org.apache.hadoop.conf.Configuration conf,
                       org.apache.hadoop.fs.Path input,
                       org.apache.hadoop.fs.Path output,
                       DistributionDescription description,
                       int numModels,
                       int maxIterations,
                       double alpha0,
                       boolean runClustering,
                       boolean emitMostLikely,
                       double threshold,
                       boolean runSequential)
                throws IOException,
                       ClassNotFoundException,
                       InterruptedException
Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to cluster the input vectors.

Parameters:
conf - the Configuration to use
input - the directory Path for input points
output - the directory Path for output points
description - model distribution parameters
maxIterations - the maximum number of iterations
alpha0 - the alpha_0 value for the DirichletDistribution
runClustering - true if clustering of points to be done after iterations
emitMostLikely - a boolean if true emit only most likely cluster for each point
threshold - a double threshold value emits all clusters having greater pdf (emitMostLikely = false)
runSequential - execute sequentially if true
Throws:
IOException
ClassNotFoundException
InterruptedException

run

public static void run(org.apache.hadoop.fs.Path input,
                       org.apache.hadoop.fs.Path output,
                       DistributionDescription description,
                       int numClusters,
                       int maxIterations,
                       double alpha0,
                       boolean runClustering,
                       boolean emitMostLikely,
                       double threshold,
                       boolean runSequential)
                throws IOException,
                       ClassNotFoundException,
                       InterruptedException
Convenience method provides default Configuration Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to cluster the input vectors.

Parameters:
input - the directory Path for input points
output - the directory Path for output points
description - model distribution parameters
numClusters - the number of models to iterate over
maxIterations - the maximum number of iterations
alpha0 - the alpha_0 value for the DirichletDistribution
runClustering - true if clustering of points to be done after iterations
emitMostLikely - a boolean if true emit only most likely cluster for each point
threshold - a double threshold value emits all clusters having greater pdf (emitMostLikely = false)
runSequential - execute sequentially if true
Throws:
IOException
ClassNotFoundException
InterruptedException

readPrototypeSize

public static int readPrototypeSize(org.apache.hadoop.fs.Path input)
                             throws IOException
Read the first input vector to determine the prototype size for the modelPrototype

Throws:
IOException

buildClusters

public static org.apache.hadoop.fs.Path buildClusters(org.apache.hadoop.conf.Configuration conf,
                                                      org.apache.hadoop.fs.Path input,
                                                      org.apache.hadoop.fs.Path output,
                                                      DistributionDescription description,
                                                      int numClusters,
                                                      int maxIterations,
                                                      double alpha0,
                                                      boolean runSequential)
                                               throws IOException,
                                                      ClassNotFoundException,
                                                      InterruptedException
Iterate over the input vectors to produce cluster directories for each iteration

Parameters:
conf -
input - the directory Path for input points
output - the directory Path for output points
description - model distribution parameters
numClusters - the number of models to iterate over
maxIterations - the maximum number of iterations
alpha0 - the alpha_0 value for the DirichletDistribution
runSequential - execute sequentially if true
Returns:
the Path of the final clusters directory
Throws:
IOException
ClassNotFoundException
InterruptedException

clusterData

public static void clusterData(org.apache.hadoop.conf.Configuration conf,
                               org.apache.hadoop.fs.Path input,
                               org.apache.hadoop.fs.Path stateIn,
                               org.apache.hadoop.fs.Path output,
                               boolean emitMostLikely,
                               double threshold,
                               boolean runSequential)
                        throws IOException,
                               InterruptedException,
                               ClassNotFoundException
Run the job using supplied arguments

Parameters:
conf -
input - the directory pathname for input points
stateIn - the directory pathname for input state
output - the directory pathname for output points
emitMostLikely - a boolean if true emit only most likely cluster for each point
threshold - a double threshold value emits all clusters having greater pdf (emitMostLikely = false)
runSequential - execute sequentially if true
Throws:
IOException
InterruptedException
ClassNotFoundException


Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.