org.apache.mahout.clustering.dirichlet
Class DirichletClusterer

java.lang.Object
  extended by org.apache.mahout.clustering.dirichlet.DirichletClusterer

public class DirichletClusterer
extends Object

Performs Bayesian mixture modeling.

The idea is that we use a probabilistic mixture of a number of models that we use to explain some observed data. The idea here is that each observed data point is assumed to have come from one of the models in the mixture, but we don't know which. The way we deal with that is to use a so-called latent parameter which specifies which model each data point came from.

In addition, since this is a Bayesian clustering algorithm, we don't want to actually commit to any single explanation, but rather to sample from the distribution of models and latent assignments of data points to models given the observed data and the prior distributions of model parameters.

This sampling process is initialized by taking models at random from the prior distribution for models.

Then, we iteratively assign points to the different models using the mixture probabilities and the degree of fit between the point and each model expressed as a probability that the point was generated by that model.

After points are assigned, new parameters for each model are sampled from the posterior distribution for the model parameters considering all of the observed data points that were assigned to the model. Models without any data points are also sampled, but since they have no points assigned, the new samples are effectively taken from the prior distribution for model parameters.

The result is a number of samples that represent mixing probabilities, models and assignment of points to models. If the total number of possible models is substantially larger than the number that ever have points assigned to them, then this algorithm provides a (nearly) non-parametric clustering algorithm.

These samples can give us interesting information that is lacking from a normal clustering that consists of a single assignment of points to clusters. Firstly, by examining the number of models in each sample that actually has any points assigned to it, we can get information about how many models (clusters) that the data support.

Morevoer, by examining how often two points are assigned to the same model, we can get an approximate measure of how likely these points are to be explained by the same model. Such soft membership information is difficult to come by with conventional clustering methods.

Finally, we can get an idea of the stability of how the data can be described. Typically, aspects of the data with lots of data available wind up with stable descriptions while at the edges, there are aspects that are phenomena that we can't really commit to a solid description, but it is still clear that the well supported explanations are insufficient to explain these additional aspects.

One thing that can be difficult about these samples is that we can't always assign a correlation between the models in the different samples. Probably the best way to do this is to look for overlap in the assignments of data observations to the different models.

    \theta_i ~ prior()
    \lambda_i ~ Dirichlet(\alpha_0)
    z_j ~ Multinomial( \lambda )
    x_j ~ model(\theta_i)
 


Constructor Summary
DirichletClusterer(boolean emitMostLikely, double threshold)
          This constructor only used by DirichletClusterMapper for setting up clustering params
DirichletClusterer(DirichletState state)
          This constructor is used by DirichletMapper and DirichletReducer for setting up their clusterer
DirichletClusterer(List<VectorWritable> sampleData, ModelDistribution<VectorWritable> modelFactory, double alpha0, int numClusters, int thin, int burnin)
          Create a new instance on the sample data with the given additional parameters
 
Method Summary
protected  int assignToModel(VectorWritable observation)
          Assign the observation to one of the models based upon probabilities
 List<Cluster[]> cluster(int numIterations)
          Iterate over the sample data, obtaining cluster samples periodically and returning them.
static List<Cluster[]> clusterPoints(List<VectorWritable> points, ModelDistribution<VectorWritable> modelFactory, double alpha0, int numClusters, int thin, int burnin, int numIterations)
          Create a new instance on the sample data with the given additional parameters
 void emitPointToClusters(VectorWritable vector, List<DirichletCluster> clusters, org.apache.hadoop.mapreduce.Mapper.Context context)
          Emit the point to one or more clusters depending upon clusterer state
 void emitPointToClusters(VectorWritable vector, List<DirichletCluster> clusters, org.apache.hadoop.io.SequenceFile.Writer writer)
          Emit the point to one or more clusters depending upon clusterer state
protected  void observe(Model<VectorWritable>[] newModels, VectorWritable observation)
           
protected  Model<VectorWritable>[] samplePosteriorModels()
           
protected  DirichletCluster updateCluster(Cluster model, int k)
           
protected  void updateModels(Cluster[] newModels)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DirichletClusterer

public DirichletClusterer(List<VectorWritable> sampleData,
                          ModelDistribution<VectorWritable> modelFactory,
                          double alpha0,
                          int numClusters,
                          int thin,
                          int burnin)
Create a new instance on the sample data with the given additional parameters

Parameters:
sampleData - the observed data to be clustered
modelFactory - the ModelDistribution to use
alpha0 - the double value for the beta distributions
numClusters - the int number of clusters
thin - the int thinning interval, used to report every n iterations
burnin - the int burnin interval, used to suppress early iterations

DirichletClusterer

public DirichletClusterer(boolean emitMostLikely,
                          double threshold)
This constructor only used by DirichletClusterMapper for setting up clustering params

Parameters:
emitMostLikely -
threshold -

DirichletClusterer

public DirichletClusterer(DirichletState state)
This constructor is used by DirichletMapper and DirichletReducer for setting up their clusterer

Parameters:
state -
Method Detail

clusterPoints

public static List<Cluster[]> clusterPoints(List<VectorWritable> points,
                                            ModelDistribution<VectorWritable> modelFactory,
                                            double alpha0,
                                            int numClusters,
                                            int thin,
                                            int burnin,
                                            int numIterations)
Create a new instance on the sample data with the given additional parameters

Parameters:
points - the observed data to be clustered
modelFactory - the ModelDistribution to use
alpha0 - the double value for the beta distributions
numClusters - the int number of clusters
thin - the int thinning interval, used to report every n iterations
burnin - the int burnin interval, used to suppress early iterations
numIterations - number of iterations to be performed

cluster

public List<Cluster[]> cluster(int numIterations)
Iterate over the sample data, obtaining cluster samples periodically and returning them.

Parameters:
numIterations - the int number of iterations to perform
Returns:
a List>> of the observed models

observe

protected void observe(Model<VectorWritable>[] newModels,
                       VectorWritable observation)
Parameters:
newModels -
observation -

assignToModel

protected int assignToModel(VectorWritable observation)
Assign the observation to one of the models based upon probabilities

Parameters:
observation -
Returns:
the assigned model's index

updateModels

protected void updateModels(Cluster[] newModels)

samplePosteriorModels

protected Model<VectorWritable>[] samplePosteriorModels()

updateCluster

protected DirichletCluster updateCluster(Cluster model,
                                         int k)

emitPointToClusters

public void emitPointToClusters(VectorWritable vector,
                                List<DirichletCluster> clusters,
                                org.apache.hadoop.mapreduce.Mapper.Context context)
                         throws IOException,
                                InterruptedException
Emit the point to one or more clusters depending upon clusterer state

Parameters:
vector - a VectorWritable holding the Vector
clusters - a List of DirichletClusters
context - a Mapper.Context to emit to
Throws:
IOException
InterruptedException

emitPointToClusters

public void emitPointToClusters(VectorWritable vector,
                                List<DirichletCluster> clusters,
                                org.apache.hadoop.io.SequenceFile.Writer writer)
                         throws IOException
Emit the point to one or more clusters depending upon clusterer state

Parameters:
vector - a VectorWritable holding the Vector
clusters - a List of DirichletClusters
writer - a SequenceFile.Writer to emit to
Throws:
IOException


Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.