Package org.apache.mahout.clustering

This package provides several clustering algorithm implementations.

See:
          Description

Interface Summary
Cluster Implementations of this interface have a printable representation and certain attributes that are common across all clustering implementations
GaussianAccumulator  
Model<O> A model is a probability distribution over observed data points and allows the probability of any data point to be computed.
ModelDistribution<O> A model distribution allows us to sample a model from its prior distribution.
 

Class Summary
AbstractCluster  
ClusterObservations  
DistanceMeasureCluster  
JsonDistanceMeasureAdapter  
JsonModelAdapter  
JsonModelDistributionAdapter  
OnlineGaussianAccumulator An online Gaussian statistics accumulator based upon Knuth (who cites Welford) which is declared to be numerically-stable.
RunningSumsGaussianAccumulator An online Gaussian accumulator that uses a running power sums approach as reported on http://en.wikipedia.org/wiki/Standard_deviation Suffers from overflow, underflow and roundoff error but has minimal observe-time overhead
VectorModelClassifier This classifier works with any of the clustering Models.
WeightedVectorWritable  
 

Package org.apache.mahout.clustering Description

This package provides several clustering algorithm implementations. Clustering usually groups a set of objects into groups of similar items. The definition of similarity usually is up to you - for text documents, cosine-distance/-similarity is recommended. Mahout also features other types of distance measure like Euclidean distance.
Input of each clustering algorithm is a set of vectors representing your items. For texts in general these are TFIDF or Bag of words representations of the documents.
Output of each clustering algorithm is either a hard or soft assignment of items to clusters.



Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.