org.apache.mahout.clustering.dirichlet
Class DirichletClusterer<O>

java.lang.Object
  extended by org.apache.mahout.clustering.dirichlet.DirichletClusterer<O>

public class DirichletClusterer<O>
extends java.lang.Object

Performs Bayesian mixture modeling.

The idea is that we use a probabilistic mixture of a number of models that we use to explain some observed data. The idea here is that each observed data point is assumed to have come from one of the models in the mixture, but we don't know which. The way we deal with that is to use a so-called latent parameter which specifies which model each data point came from.

In addition, since this is a Bayesian clustering algorithm, we don't want to actually commit to any single explanation, but rather to sample from the distribution of models and latent assignments of data points to models given the observed data and the prior distributions of model parameters.

This sampling process is initialized by taking models at random from the prior distribution for models.

Then, we iteratively assign points to the different models using the mixture probabilities and the degree of fit between the point and each model expressed as a probability that the point was generated by that model.

After points are assigned, new parameters for each model are sampled from the posterior distribution for the model parameters considering all of the observed data points that were assigned to the model. Models without any data points are also sampled, but since they have no points assigned, the new samples are effectively taken from the prior distribution for model parameters.

The result is a number of samples that represent mixing probabilities, models and assignment of points to models. If the total number of possible models is substantially larger than the number that ever have points assigned to them, then this algorithm provides a (nearly) non-parametric clustering algorithm.

These samples can give us interesting information that is lacking from a normal clustering that consists of a single assignment of points to clusters. Firstly, by examining the number of models in each sample that actually has any points assigned to it, we can get information about how many models (clusters) that the data support.

Morevoer, by examining how often two points are assigned to the same model, we can get an approximate measure of how likely these points are to be explained by the same model. Such soft membership information is difficult to come by with conventional clustering methods.

Finally, we can get an idea of the stability of how the data can be described. Typically, aspects of the data with lots of data available wind up with stable descriptions while at the edges, there are aspects that are phenomena that we can't really commit to a solid description, but it is still clear that the well supported explanations are insufficient to explain these additional aspects.

One thing that can be difficult about these samples is that we can't always assign a correlation between the models in the different samples. Probably the best way to do this is to look for overlap in the assignments of data observations to the different models.

    \theta_i ~ prior()
    \lambda_i ~ Dirichlet(\alpha_0)
    z_j ~ Multinomial( \lambda )
    x_j ~ model(\theta_i)
 


Constructor Summary
DirichletClusterer(java.util.List<O> sampleData, ModelDistribution<O> modelFactory, double alpha_0, int numClusters, int thin, int burnin)
          Create a new instance on the sample data with the given additional parameters
 
Method Summary
 java.util.List<Model<O>[]> cluster(int numIterations)
          Iterate over the sample data, obtaining cluster samples periodically and returning them.
static java.util.List<Model<Vector>[]> clusterPoints(java.util.List<Vector> points, ModelDistribution<Vector> modelFactory, double alpha_0, int numClusters, int thin, int burnin, int numIterations)
          Create a new instance on the sample data with the given additional parameters
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DirichletClusterer

public DirichletClusterer(java.util.List<O> sampleData,
                          ModelDistribution<O> modelFactory,
                          double alpha_0,
                          int numClusters,
                          int thin,
                          int burnin)
Create a new instance on the sample data with the given additional parameters

Parameters:
sampleData - the observed data to be clustered
modelFactory - the ModelDistribution to use
alpha_0 - the double value for the beta distributions
numClusters - the int number of clusters
thin - the int thinning interval, used to report every n iterations
burnin - the int burnin interval, used to suppress early iterations
Method Detail

cluster

public java.util.List<Model<O>[]> cluster(int numIterations)
Iterate over the sample data, obtaining cluster samples periodically and returning them.

Parameters:
numIterations - the int number of iterations to perform
Returns:
a List>> of the observed models

clusterPoints

public static java.util.List<Model<Vector>[]> clusterPoints(java.util.List<Vector> points,
                                                            ModelDistribution<Vector> modelFactory,
                                                            double alpha_0,
                                                            int numClusters,
                                                            int thin,
                                                            int burnin,
                                                            int numIterations)
Create a new instance on the sample data with the given additional parameters

Parameters:
points - the observed data to be clustered
modelFactory - the ModelDistribution to use
alpha_0 - the double value for the beta distributions
numClusters - the int number of clusters
thin - the int thinning interval, used to report every n iterations
burnin - the int burnin interval, used to suppress early iterations
numIterations - number of iterations to be performed


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.