org.apache.mahout.clustering.canopy
Class CanopyClusterer

java.lang.Object
  extended by org.apache.mahout.clustering.canopy.CanopyClusterer

public class CanopyClusterer
extends java.lang.Object


Constructor Summary
CanopyClusterer(DistanceMeasure measure, double t1, double t2)
           
CanopyClusterer(org.apache.hadoop.mapred.JobConf job)
           
 
Method Summary
 void addPointToCanopies(Vector point, java.util.List<Canopy> canopies, org.apache.hadoop.mapred.Reporter reporter)
          This is the same algorithm as the reference but inverted to iterate over existing canopies instead of the points.
static java.util.List<Vector> calculateCentroids(java.util.List<Canopy> canopies)
          Iterate through the canopies, adding their centroids to a list
 boolean canopyCovers(Canopy canopy, Vector point)
          Return if the point is covered by the canopy
 void config(DistanceMeasure aMeasure, double aT1, double aT2)
          Configure the Canopy for unit tests
 void configure(org.apache.hadoop.mapred.JobConf job)
          Configure the Canopy and its distance measure
static java.util.List<Canopy> createCanopies(java.util.List<Vector> points, DistanceMeasure measure, double t1, double t2)
          Iterate through the points, adding new canopies.
 void emitPointToExistingCanopies(Vector point, java.util.List<Canopy> canopies, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,VectorWritable> collector, org.apache.hadoop.mapred.Reporter reporter)
          This method is used by the CanopyMapper to perform canopy inclusion tests and to emit the point keyed by its covering canopies to the output.
 void emitPointToNewCanopies(Vector point, java.util.List<Canopy> canopies, org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,Vector> collector)
          This method is used by the CanopyMapper to perform canopy inclusion tests and to emit the point and its covering canopies to the output.
static void updateCentroids(java.util.List<Canopy> canopies)
          Iterate through the canopies, resetting their center to their centroids
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CanopyClusterer

public CanopyClusterer(DistanceMeasure measure,
                       double t1,
                       double t2)

CanopyClusterer

public CanopyClusterer(org.apache.hadoop.mapred.JobConf job)
Method Detail

configure

public void configure(org.apache.hadoop.mapred.JobConf job)
Configure the Canopy and its distance measure

Parameters:
job - the JobConf for this job

config

public void config(DistanceMeasure aMeasure,
                   double aT1,
                   double aT2)
Configure the Canopy for unit tests


addPointToCanopies

public void addPointToCanopies(Vector point,
                               java.util.List<Canopy> canopies,
                               org.apache.hadoop.mapred.Reporter reporter)
This is the same algorithm as the reference but inverted to iterate over existing canopies instead of the points. Because of this it does not need to actually store the points, instead storing a total points vector and the number of points. From this a centroid can be computed.

This method is used by the CanopyReducer.

Parameters:
point - the point to be added
canopies - the List to be appended
reporter - Object to report status to the MR interface

emitPointToNewCanopies

public void emitPointToNewCanopies(Vector point,
                                   java.util.List<Canopy> canopies,
                                   org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,Vector> collector)
                            throws java.io.IOException
This method is used by the CanopyMapper to perform canopy inclusion tests and to emit the point and its covering canopies to the output. The CanopyCombiner will then sum the canopy points and produce the centroids.

Parameters:
point - the point to be added
canopies - the List to be appended
collector - an OutputCollector in which to emit the point
Throws:
java.io.IOException

emitPointToExistingCanopies

public void emitPointToExistingCanopies(Vector point,
                                        java.util.List<Canopy> canopies,
                                        org.apache.hadoop.mapred.OutputCollector<org.apache.hadoop.io.Text,VectorWritable> collector,
                                        org.apache.hadoop.mapred.Reporter reporter)
                                 throws java.io.IOException
This method is used by the CanopyMapper to perform canopy inclusion tests and to emit the point keyed by its covering canopies to the output. if the point is not covered by any canopies (due to canopy centroid clustering), emit the point to the closest covering canopy.

Parameters:
point - the point to be added
canopies - the List to be appended
collector - an OutputCollector in which to emit the point
reporter - to report status of the job
Throws:
java.io.IOException

canopyCovers

public boolean canopyCovers(Canopy canopy,
                            Vector point)
Return if the point is covered by the canopy

Parameters:
point - a point
Returns:
if the point is covered

createCanopies

public static java.util.List<Canopy> createCanopies(java.util.List<Vector> points,
                                                    DistanceMeasure measure,
                                                    double t1,
                                                    double t2)
Iterate through the points, adding new canopies. Return the canopies.

Parameters:
points - a list defining the points to be clustered
measure - a DistanceMeasure to use
t1 - the T1 distance threshold
t2 - the T2 distance threshold
Returns:
the List created

calculateCentroids

public static java.util.List<Vector> calculateCentroids(java.util.List<Canopy> canopies)
Iterate through the canopies, adding their centroids to a list

Parameters:
canopies - a List
Returns:
the List

updateCentroids

public static void updateCentroids(java.util.List<Canopy> canopies)
Iterate through the canopies, resetting their center to their centroids

Parameters:
canopies - a List


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.