org.apache.mahout.math.hadoop.similarity
Class RowSimilarityJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.math.hadoop.similarity.RowSimilarityJob
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class RowSimilarityJob
extends AbstractJob

Runs a completely distributed computation of the pairwise similarity of the row vectors of a DistributedRowMatrix as a series of mapreduces.

The algorithm used is a slight modification of the algorithm described in T. Elsayed et al: "Pairwise document similarity in large collections with MapReduce" (http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf)

Command line arguments specific to this class are:

  1. -Dmapred.input.dir=(path): Directory containing a DistributedRowMatrix as a SequenceFile
  2. -Dmapred.output.dir=(path): output path where the computations output should go (a DistributedRowMatrix stored as a SequenceFile)
  3. --numberOfColumns: the number of columns in the input matrix
  4. --similarityClassname (classname): an implementation of DistributedVectorSimilarity used to compute the similarity
  5. --maxSimilaritiesPerRow (integer): cap the number of similar rows per row to this number (100)

General command line options are documented in AbstractJob.

Please consider supplying a --tempDir parameter for this job, as is needs to write some intermediate files

Note that because of how Hadoop parses arguments, all "-D" arguments must appear before all other arguments.


Nested Class Summary
static class RowSimilarityJob.CooccurrencesMapper
          maps all pairs of weighted entries of a column vector
static class RowSimilarityJob.Counter
           
static class RowSimilarityJob.EntriesToVectorsReducer
          collects all MatrixEntryWritable for each column and creates a VectorWritable
static class RowSimilarityJob.RowWeightMapper
          applies DistributedVectorSimilarity.weight(Vector) to each row of the input matrix
static class RowSimilarityJob.SimilarityReducer
          computes the pairwise similarities
static class RowSimilarityJob.WeightedOccurrencesPerColumnReducer
          collects all WeightedOccurrences for a column and writes them to a WeightedOccurrenceArray
 
Field Summary
static String DISTRIBUTED_SIMILARITY_CLASSNAME
           
static String MAX_SIMILARITIES_PER_ROW
           
static String NUMBER_OF_COLUMNS
           
 
Constructor Summary
RowSimilarityJob()
           
 
Method Summary
static void main(String[] args)
           
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, getInputPath, getOption, getOutputPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, shouldRunNextPhase
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

DISTRIBUTED_SIMILARITY_CLASSNAME

public static final String DISTRIBUTED_SIMILARITY_CLASSNAME

NUMBER_OF_COLUMNS

public static final String NUMBER_OF_COLUMNS

MAX_SIMILARITIES_PER_ROW

public static final String MAX_SIMILARITIES_PER_ROW
Constructor Detail

RowSimilarityJob

public RowSimilarityJob()
Method Detail

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws IOException,
               ClassNotFoundException,
               InterruptedException
Throws:
IOException
ClassNotFoundException
InterruptedException


Copyright © 2008-2011 The Apache Software Foundation. All Rights Reserved.