org.apache.mahout.clustering.kmeans
Class KMeansClusterer

java.lang.Object
  extended by org.apache.mahout.clustering.kmeans.KMeansClusterer

public class KMeansClusterer
extends java.lang.Object

This class implements the k-means clustering algorithm. It uses Cluster as a cluster representation. The class can be used as part of a clustering job to be started as map/reduce job.


Constructor Summary
KMeansClusterer(DistanceMeasure measure)
          Init the k-means clusterer with the distance measure to use for comparison.
 
Method Summary
static java.util.List<java.util.List<Cluster>> clusterPoints(java.util.List<Vector> points, java.util.List<Cluster> clusters, DistanceMeasure measure, int maxIter, double distanceThreshold)
          This is the reference k-means implementation.
 void emitPointToNearestCluster(Vector point, java.util.List<Cluster> clusters, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,KMeansInfo> output)
          Iterates over all clusters and identifies the one closes to the given point.
 void outputPointWithClusterInfo(Vector point, java.util.List<Cluster> clusters, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> output)
           
static boolean runKMeansIteration(java.util.List<Vector> points, java.util.List<Cluster> clusters, DistanceMeasure measure, double distanceThreshold)
          Perform a single iteration over the points and clusters, assigning points to clusters and returning if the iterations are completed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KMeansClusterer

public KMeansClusterer(DistanceMeasure measure)
Init the k-means clusterer with the distance measure to use for comparison.

Parameters:
measure - The distance measure to use for comparing clusters against points.
Method Detail

emitPointToNearestCluster

public void emitPointToNearestCluster(Vector point,
                                      java.util.List<Cluster> clusters,
                                      org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,KMeansInfo> output)
                               throws java.io.IOException
Iterates over all clusters and identifies the one closes to the given point. Distance measure used is configured at creation time of .

Parameters:
point - a point to find a cluster for.
clusters - a List to test.
Throws:
java.io.IOException

outputPointWithClusterInfo

public void outputPointWithClusterInfo(Vector point,
                                       java.util.List<Cluster> clusters,
                                       org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,org.apache.hadoop.io.Text> output)
                                throws java.io.IOException
Throws:
java.io.IOException

clusterPoints

public static java.util.List<java.util.List<Cluster>> clusterPoints(java.util.List<Vector> points,
                                                                    java.util.List<Cluster> clusters,
                                                                    DistanceMeasure measure,
                                                                    int maxIter,
                                                                    double distanceThreshold)
This is the reference k-means implementation. Given its inputs it iterates over the points and clusters until their centers converge or until the maximum number of iterations is exceeded.

Parameters:
points - the input List of points
clusters - the List of initial clusters
measure - the DistanceMeasure to use
maxIter - the maximum number of iterations

runKMeansIteration

public static boolean runKMeansIteration(java.util.List<Vector> points,
                                         java.util.List<Cluster> clusters,
                                         DistanceMeasure measure,
                                         double distanceThreshold)
Perform a single iteration over the points and clusters, assigning points to clusters and returning if the iterations are completed.

Parameters:
points - the List having the input points
clusters - the List clusters
measure - a DistanceMeasure to use
Returns:


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.