org.apache.mahout.clustering
Class ClusterClassifier

java.lang.Object
  extended by org.apache.mahout.classifier.AbstractVectorClassifier
      extended by org.apache.mahout.clustering.ClusterClassifier
All Implemented Interfaces:
Closeable, org.apache.hadoop.io.Writable, OnlineLearner

public class ClusterClassifier
extends AbstractVectorClassifier
implements OnlineLearner, org.apache.hadoop.io.Writable

This classifier works with any clustering Cluster. It is initialized with a list of compatible clusters and thereafter it can classify any new Vector into one or more of the clusters based upon the pdf() function which each cluster supports. In addition, it is an OnlineLearner and can be trained. Training amounts to asking the actual model to observe the vector and closing the classifier causes all the models to computeParameters.


Constructor Summary
ClusterClassifier()
           
ClusterClassifier(List<Cluster> models)
          The public constructor accepts a list of clusters to become the models
 
Method Summary
 Vector classify(Vector instance)
          Classify a vector returning a vector of numCategories-1 scores.
 double classifyScalar(Vector instance)
          Classifies a vector in the special case of a binary classifier where AbstractVectorClassifier.classify(Vector) would return a vector with only one element.
 void close()
          Prepares the classifier for classification and deallocates any temporary data structures.
 List<Cluster> getModels()
           
 int numCategories()
          Returns the number of categories for the target variable.
 void readFields(DataInput in)
           
 void train(int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void train(int actual, Vector data, double weight)
          Train the models given an additional weight.
 void train(long trackingKey, int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void train(long trackingKey, String groupKey, int actual, Vector instance)
          Updates the model using a particular target variable value and a feature vector.
 void write(DataOutput out)
           
 
Methods inherited from class org.apache.mahout.classifier.AbstractVectorClassifier
classify, classifyFull, classifyFull, classifyFull, classifyNoLink, classifyScalar, logLikelihood
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ClusterClassifier

public ClusterClassifier(List<Cluster> models)
The public constructor accepts a list of clusters to become the models

Parameters:
models - a List

ClusterClassifier

public ClusterClassifier()
Method Detail

classify

public Vector classify(Vector instance)
Description copied from class: AbstractVectorClassifier
Classify a vector returning a vector of numCategories-1 scores. It is assumed that the score for the missing category is one minus the sum of the scores that are returned. Note that the missing score is the 0-th score.

Specified by:
classify in class AbstractVectorClassifier
Parameters:
instance - A feature vector to be classified.
Returns:
A vector of probabilities in 1 of n-1 encoding.

classifyScalar

public double classifyScalar(Vector instance)
Description copied from class: AbstractVectorClassifier
Classifies a vector in the special case of a binary classifier where AbstractVectorClassifier.classify(Vector) would return a vector with only one element. As such, using this method can void the allocation of a vector.

Specified by:
classifyScalar in class AbstractVectorClassifier
Parameters:
instance - The feature vector to be classified.
Returns:
The score for category 1.
See Also:
AbstractVectorClassifier.classify(Vector)

numCategories

public int numCategories()
Description copied from class: AbstractVectorClassifier
Returns the number of categories for the target variable. A vector classifier will encode it's output using a zero-based 1 of numCategories encoding.

Specified by:
numCategories in class AbstractVectorClassifier
Returns:
The number of categories.

write

public void write(DataOutput out)
           throws IOException
Specified by:
write in interface org.apache.hadoop.io.Writable
Throws:
IOException

readFields

public void readFields(DataInput in)
                throws IOException
Specified by:
readFields in interface org.apache.hadoop.io.Writable
Throws:
IOException

train

public void train(int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary, then the training examples will be presented in the same order. This is because the order of training examples may be used to assign records to different data splits for evaluation by cross-validation. Without the order invariance, records might be assigned to training and test splits and error estimates could be seriously affected.

If re-ordering is necessary, then using the alternative API which allows a tracking key to be added to the training example can be used.

Specified by:
train in interface OnlineLearner
Parameters:
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

train

public void train(int actual,
                  Vector data,
                  double weight)
Train the models given an additional weight. Unique to ClusterClassifier

Parameters:
actual - the int index of a model
data - a data Vector
weight - a double weighting factor

train

public void train(long trackingKey,
                  String groupKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
groupKey - An optional value that allows examples to be grouped in the computation of the update to the model.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

train

public void train(long trackingKey,
                  int actual,
                  Vector instance)
Description copied from interface: OnlineLearner
Updates the model using a particular target variable value and a feature vector.

There may an assumption that if multiple passes through the training data are necessary that the tracking key for a record will be the same for each pass and that there will be a relatively large number of distinct tracking keys and that the low-order bits of the tracking keys will not correlate with any of the input variables. This tracking key is used to assign training examples to different test/training splits.

Examples of useful tracking keys include id-numbers for the training records derived from a database id for the base table from the which the record is derived, or the offset of the original data record in a data file.

Specified by:
train in interface OnlineLearner
Parameters:
trackingKey - The tracking key for this training example.
actual - The value of the target variable. This value should be in the half-open interval [0..n) where n is the number of target categories.
instance - The feature vector for this example.

close

public void close()
Description copied from interface: OnlineLearner
Prepares the classifier for classification and deallocates any temporary data structures. An online classifier should be able to accept more training after being closed, but closing the classifier may make classification more efficient.

Specified by:
close in interface Closeable
Specified by:
close in interface OnlineLearner

getModels

public List<Cluster> getModels()


Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.