org.apache.mahout.clustering.kmeans
Class KMeansClusterer

java.lang.Object
  extended by org.apache.mahout.clustering.kmeans.KMeansClusterer

public class KMeansClusterer
extends Object

This class implements the k-means clustering algorithm. It uses Cluster as a cluster representation. The class can be used as part of a clustering job to be started as map/reduce job.


Constructor Summary
KMeansClusterer(DistanceMeasure measure)
          Init the k-means clusterer with the distance measure to use for comparison.
 
Method Summary
protected  void addPointToNearestCluster(Vector point, Iterable<Cluster> clusters)
          Sequential implementation to add point to the nearest cluster
static List<List<Cluster>> clusterPoints(Iterable<Vector> points, List<Cluster> clusters, DistanceMeasure measure, int maxIter, double distanceThreshold)
          This is the reference k-means implementation.
 boolean computeConvergence(Cluster cluster, double distanceThreshold)
           
 void emitPointToNearestCluster(Vector point, Iterable<Cluster> clusters, org.apache.hadoop.mapreduce.Mapper.Context context)
          Iterates over all clusters and identifies the one closes to the given point.
protected  void emitPointToNearestCluster(Vector point, Iterable<Cluster> clusters, org.apache.hadoop.io.SequenceFile.Writer writer)
          Iterates over all clusters and identifies the one closes to the given point.
 void outputPointWithClusterInfo(Vector vector, Iterable<Cluster> clusters, org.apache.hadoop.mapreduce.Mapper.Context context)
           
protected static boolean runKMeansIteration(Iterable<Vector> points, Iterable<Cluster> clusters, DistanceMeasure measure, double distanceThreshold)
          Perform a single iteration over the points and clusters, assigning points to clusters and returning if the iterations are completed.
protected  boolean testConvergence(Iterable<Cluster> clusters, double distanceThreshold)
          Sequential implementation to test convergence and update cluster centers
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KMeansClusterer

public KMeansClusterer(DistanceMeasure measure)
Init the k-means clusterer with the distance measure to use for comparison.

Parameters:
measure - The distance measure to use for comparing clusters against points.
Method Detail

emitPointToNearestCluster

public void emitPointToNearestCluster(Vector point,
                                      Iterable<Cluster> clusters,
                                      org.apache.hadoop.mapreduce.Mapper.Context context)
                               throws IOException,
                                      InterruptedException
Iterates over all clusters and identifies the one closes to the given point. Distance measure used is configured at creation time.

Parameters:
point - a point to find a cluster for.
clusters - a List to test.
Throws:
IOException
InterruptedException

addPointToNearestCluster

protected void addPointToNearestCluster(Vector point,
                                        Iterable<Cluster> clusters)
Sequential implementation to add point to the nearest cluster

Parameters:
point -
clusters -

testConvergence

protected boolean testConvergence(Iterable<Cluster> clusters,
                                  double distanceThreshold)
Sequential implementation to test convergence and update cluster centers


outputPointWithClusterInfo

public void outputPointWithClusterInfo(Vector vector,
                                       Iterable<Cluster> clusters,
                                       org.apache.hadoop.mapreduce.Mapper.Context context)
                                throws IOException,
                                       InterruptedException
Throws:
IOException
InterruptedException

emitPointToNearestCluster

protected void emitPointToNearestCluster(Vector point,
                                         Iterable<Cluster> clusters,
                                         org.apache.hadoop.io.SequenceFile.Writer writer)
                                  throws IOException
Iterates over all clusters and identifies the one closes to the given point. Distance measure used is configured at creation time.

Parameters:
point - a point to find a cluster for.
clusters - a List to test.
Throws:
IOException

clusterPoints

public static List<List<Cluster>> clusterPoints(Iterable<Vector> points,
                                                List<Cluster> clusters,
                                                DistanceMeasure measure,
                                                int maxIter,
                                                double distanceThreshold)
This is the reference k-means implementation. Given its inputs it iterates over the points and clusters until their centers converge or until the maximum number of iterations is exceeded.

Parameters:
points - the input List of points
clusters - the List of initial clusters
measure - the DistanceMeasure to use
maxIter - the maximum number of iterations

runKMeansIteration

protected static boolean runKMeansIteration(Iterable<Vector> points,
                                            Iterable<Cluster> clusters,
                                            DistanceMeasure measure,
                                            double distanceThreshold)
Perform a single iteration over the points and clusters, assigning points to clusters and returning if the iterations are completed.

Parameters:
points - the List having the input points
clusters - the List clusters
measure - a DistanceMeasure to use

computeConvergence

public boolean computeConvergence(Cluster cluster,
                                  double distanceThreshold)


Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.