org.apache.mahout.classifier
Class AbstractVectorClassifier

java.lang.Object
  extended by org.apache.mahout.classifier.AbstractVectorClassifier
Direct Known Subclasses:
AbstractNaiveBayesClassifier, AbstractOnlineLogisticRegression, CrossFoldLearner, VectorModelClassifier

public abstract class AbstractVectorClassifier
extends java.lang.Object

Defines the interface for classifiers that take input as a vector. This is implemented as an abstract class so that it can implement a number of handy convenience methods related to classification of vectors.


Constructor Summary
AbstractVectorClassifier()
           
 
Method Summary
 Matrix classify(Matrix data)
          Returns n-1 probabilities, one for each category but the last, for each row of a matrix.
abstract  Vector classify(Vector instance)
          Classify a vector returning a vector of numCategories-1 scores.
 Matrix classifyFull(Matrix data)
          Returns n probabilities, one for each category, for each row of a matrix.
 Vector classifyFull(Vector instance)
          Returns n probabilities, one for each category.
 Vector classifyFull(Vector r, Vector instance)
          Returns n probabilities, one for each category into a pre-allocated vector.
 Vector classifyNoLink(Vector features)
          Classify a vector, but don't apply the inverse link function.
 Vector classifyScalar(Matrix data)
          Returns a vector of probabilities of the first category, one for each row of a matrix.
abstract  double classifyScalar(Vector instance)
          Classifies a vector in the special case of a binary classifier where classify(Vector) would return a vector with only one element.
 double logLikelihood(int actual, Vector data)
          Returns a measure of how good the classification for a particular example actually is.
abstract  int numCategories()
          Returns the number of categories for the target variable.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractVectorClassifier

public AbstractVectorClassifier()
Method Detail

numCategories

public abstract int numCategories()
Returns the number of categories for the target variable. A vector classifier will encode it's output using a zero-based 1 of numCategories encoding.

Returns:
The number of categories.

classify

public abstract Vector classify(Vector instance)
Classify a vector returning a vector of numCategories-1 scores. It is assumed that the score for the missing category is one minus the sum of the scores that are returned. Note that the missing score is the 0-th score.

Parameters:
instance - A feature vector to be classified.
Returns:
A vector of probabilities in 1 of n-1 encoding.

classifyNoLink

public Vector classifyNoLink(Vector features)
Classify a vector, but don't apply the inverse link function. For logistic regression and other generalized linear models, this is just the linear part of the classification.

Parameters:
features - A feature vector to be classified.
Returns:
A vector of scores. If transformed by the link function, these will become probabilities.

classifyScalar

public abstract double classifyScalar(Vector instance)
Classifies a vector in the special case of a binary classifier where classify(Vector) would return a vector with only one element. As such, using this method can void the allocation of a vector.

Parameters:
instance - The feature vector to be classified.
Returns:
The score for category 1.
See Also:
classify(Vector)

classifyFull

public Vector classifyFull(Vector instance)
Returns n probabilities, one for each category. If you can use an n-1 coding, and are touchy about allocation performance, then the classify method is probably better to use. The 0-th element of the score vector returned by this method is the missing score as computed by the classify method.

Parameters:
instance - A vector of features to be classified.
Returns:
A vector of probabilities, one for each category.
See Also:
classify(Vector), classifyFull(Vector r, Vector instance)

classifyFull

public Vector classifyFull(Vector r,
                           Vector instance)
Returns n probabilities, one for each category into a pre-allocated vector. One vector allocation is still done in the process of multiplying by the coefficient matrix, but that is hard to avoid. The cost of such an ephemeral allocation is very small in any case compared to the multiplication itself.

Parameters:
r - Where to put the results.
instance - A vector of features to be classified.
Returns:
A vector of probabilities, one for each category.

classify

public Matrix classify(Matrix data)
Returns n-1 probabilities, one for each category but the last, for each row of a matrix. The probability of the missing 0-th category is 1 - rowSum(this result).

Parameters:
data - The matrix whose rows are vectors to classify
Returns:
A matrix of scores, one row per row of the input matrix, one column for each but the last category.

classifyFull

public Matrix classifyFull(Matrix data)
Returns n probabilities, one for each category, for each row of a matrix.

Parameters:
data - The matrix whose rows are vectors to classify
Returns:
A matrix of scores, one row per row of the input matrix, one column for each but the last category.

classifyScalar

public Vector classifyScalar(Matrix data)
Returns a vector of probabilities of the first category, one for each row of a matrix. This only makes sense if there are exactly two categories, but calling this method in that case can save a number of vector allocations.

Parameters:
data - The matrix whose rows are vectors to classify
Returns:
A vector of scores, with one value per row of the input matrix.

logLikelihood

public double logLikelihood(int actual,
                            Vector data)
Returns a measure of how good the classification for a particular example actually is.

Parameters:
actual - The correct category for the example.
data - The vector to be classified.
Returns:
The log likelihood of the correct answer as estimated by the current model. This will always be <= 0 and larger (closer to 0) indicates better accuracy. In order to simplify code that maintains running averages, we bound this value at -100.


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.