org.apache.mahout.math.hadoop.similarity
Class RowSimilarityJob
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.mahout.common.AbstractJob
org.apache.mahout.math.hadoop.similarity.RowSimilarityJob
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool
public class RowSimilarityJob
- extends AbstractJob
Runs a completely distributed computation of the pairwise similarity of the row vectors of a
DistributedRowMatrix
as a series of mapreduces.
The algorithm used is a slight modification of the algorithm described in
T. Elsayed et al: "Pairwise document similarity in large collections with MapReduce"
(http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf)
Command line arguments specific to this class are:
- -Dmapred.input.dir=(path): Directory containing a
DistributedRowMatrix
as a
SequenceFile
- -Dmapred.output.dir=(path): output path where the computations output should go (a
DistributedRowMatrix
stored as a SequenceFile)
- --numberOfColumns: the number of columns in the input matrix
- --similarityClassname (classname): an implementation of
DistributedVectorSimilarity
used to compute the
similarity
- --maxSimilaritiesPerRow (integer): cap the number of similar rows per row to this number (100)
General command line options are documented in AbstractJob
.
Please consider supplying a --tempDir parameter for this job, as is needs to write some intermediate files
Note that because of how Hadoop parses arguments, all "-D" arguments must appear before all other
arguments.
Method Summary |
static void |
main(java.lang.String[] args)
|
int |
run(java.lang.String[] args)
|
Methods inherited from class org.apache.mahout.common.AbstractJob |
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, getInputPath, getOption, getOutputPath, hasOption, keyFor, maybePut, parseArguments, parseDirectories, prepareJob, shouldRunNextPhase |
Methods inherited from class org.apache.hadoop.conf.Configured |
getConf, setConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
getConf, setConf |
DISTRIBUTED_SIMILARITY_CLASSNAME
public static final java.lang.String DISTRIBUTED_SIMILARITY_CLASSNAME
NUMBER_OF_COLUMNS
public static final java.lang.String NUMBER_OF_COLUMNS
MAX_SIMILARITIES_PER_ROW
public static final java.lang.String MAX_SIMILARITIES_PER_ROW
RowSimilarityJob
public RowSimilarityJob()
main
public static void main(java.lang.String[] args)
throws java.lang.Exception
- Throws:
java.lang.Exception
run
public int run(java.lang.String[] args)
throws java.io.IOException,
java.lang.ClassNotFoundException,
java.lang.InterruptedException
- Throws:
java.io.IOException
java.lang.ClassNotFoundException
java.lang.InterruptedException
Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.