org.apache.mahout.math.hadoop.similarity
Class RowSimilarityJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.math.hadoop.similarity.RowSimilarityJob
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class RowSimilarityJob
extends AbstractJob

Runs a completely distributed computation of the pairwise similarity of the row vectors of a DistributedRowMatrix as a series of mapreduces.

The algorithm used is a slight modification of the algorithm described in T. Elsayed et al: "Pairwise document similarity in large collections with MapReduce" (http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf)

Command line arguments specific to this class are:

  1. -Dmapred.input.dir=(path): Directory containing a DistributedRowMatrix as a SequenceFile
  2. -Dmapred.output.dir=(path): output path where the computations output should go (a DistributedRowMatrix stored as a SequenceFile)
  3. --numberOfColumns: the number of columns in the input matrix
  4. --similarityClassname (classname): an implementation of DistributedVectorSimilarity used to compute the similarity
  5. --maxSimilaritiesPerRow (integer): cap the number of similar rows per row to this number (100)

General command line options are documented in AbstractJob.

Please consider supplying a --tempDir parameter for this job, as is needs to write some intermediate files

Note that because of how Hadoop parses arguments, all "-D" arguments must appear before all other arguments.


Nested Class Summary
static class RowSimilarityJob.CooccurrencesMapper
          maps all pairs of weighted entries of a column vector
static class RowSimilarityJob.EntriesToVectorsReducer
          collects all DistributedRowMatrix.MatrixEntryWritable for each column and creates a VectorWritable
static class RowSimilarityJob.RowWeightMapper
          applies DistributedVectorSimilarity.weight(Vector) to each row of the input matrix
static class RowSimilarityJob.SimilarityReducer
          computes the pairwise similarities
static class RowSimilarityJob.WeightedOccurrencesPerColumnReducer
          collects all WeightedOccurrences for a column and writes them to a WeightedOccurrenceArray
 
Field Summary
static java.lang.String DISTRIBUTED_SIMILARITY_CLASSNAME
           
static java.lang.String MAX_SIMILARITIES_PER_ROW
           
static java.lang.String NUMBER_OF_COLUMNS
           
 
Constructor Summary
RowSimilarityJob()
           
 
Method Summary
static void main(java.lang.String[] args)
           
 int run(java.lang.String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, getInputPath, getOption, getOutputPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, shouldRunNextPhase
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

DISTRIBUTED_SIMILARITY_CLASSNAME

public static final java.lang.String DISTRIBUTED_SIMILARITY_CLASSNAME

NUMBER_OF_COLUMNS

public static final java.lang.String NUMBER_OF_COLUMNS

MAX_SIMILARITIES_PER_ROW

public static final java.lang.String MAX_SIMILARITIES_PER_ROW
Constructor Detail

RowSimilarityJob

public RowSimilarityJob()
Method Detail

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Throws:
java.lang.Exception

run

public int run(java.lang.String[] args)
        throws java.io.IOException,
               java.lang.ClassNotFoundException,
               java.lang.InterruptedException
Throws:
java.io.IOException
java.lang.ClassNotFoundException
java.lang.InterruptedException


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.