org.apache.mahout.math.hadoop.stochasticsvd
Class SSVDSolver

java.lang.Object
  extended by org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver

public class SSVDSolver
extends Object

Stochastic SVD solver (API class).

Implementation details are in my working notes in MAHOUT-376 (https://issues.apache.org/jira/browse/MAHOUT-376).

As of the time of this writing, I don't have benchmarks for this method in comparison to other methods. However, non-hadoop differentiating characteristics of this method are thought to be :

  • "faster" and precision is traded off in favor of speed. However, there's lever in terms of "oversampling parameter" p. Higher values of p produce better precision but are trading off speed (and minimum RAM requirement). This also means that this method is almost guaranteed to be less precise than Lanczos unless full rank SVD decomposition is sought.
  • "more scale" -- can presumably take on larger problems than Lanczos one (not confirmed by benchmark at this time)

    Specifically in regards to this implementation, I think couple of other differentiating points are:

  • no need to specify input matrix height or width in command line, it is what it gets to be.
  • supports any Writable as DRM row keys and copies them to correspondent rows of U matrix;
  • can request U or V or Uσ=U* Σ0.5 or Vσ=V* Σ0.5 none of which would require pass over input A and these jobs are parallel map-only jobs.

    This class is central public API for SSVD solver. The use pattern is as follows:


    Constructor Summary
    SSVDSolver(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path[] inputPath, org.apache.hadoop.fs.Path outputPath, int ablockRows, int k, int p, int reduceTasks)
              create new SSVD solver.
     
    Method Summary
     int getAbtBlockHeight()
               
     int getOuterBlockHeight()
               
     int getQ()
               
     double[] getSingularValues()
              This contains k+p singular values resulted from the solver run.
     String getUPath()
              returns U path (if computation were requested and successful).
     String getVPath()
              return V path ( if computation was requested and successful ) .
     boolean isBroadcast()
               
     boolean isOverwrite()
               
    static UpperTriangular loadAndSumUpperTriangularMatrices(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path glob, org.apache.hadoop.conf.Configuration conf)
              Load multiplel upper triangular matrices and sum them up.
    static double[][] loadDistributedRowMatrix(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path glob, org.apache.hadoop.conf.Configuration conf)
              helper capabiltiy to load distributed row matrices into dense matrix (to support tests mainly).
    static UpperTriangular loadUpperTriangularMatrix(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path glob, org.apache.hadoop.conf.Configuration conf)
              Load only one upper triangular matrix and issue error if mroe than one is found.
     void run()
              run all SSVD jobs.
     void setAbtBlockHeight(int abtBlockHeight)
              the block height of Y_i during power iterations.
     void setBroadcast(boolean broadcast)
              If this property is true, use DestributedCache mechanism to broadcast some stuff around.
     void setComputeU(boolean val)
              The setting controlling whether to compute U matrix of low rank SSVD.
     void setComputeV(boolean val)
              Setting controlling whether to compute V matrix of low-rank SSVD.
     void setcUHalfSigma(boolean cUHat)
               
     void setcVHalfSigma(boolean cVHat)
               
     void setMinSplitSize(int size)
              Sometimes, if requested A blocks become larger than a split, we may need to use that to ensure at least k+p rows of A get into a split.
     void setOuterBlockHeight(int outerBlockHeight)
              The height of outer blocks during Q'A multiplication.
     void setOverwrite(boolean overwrite)
              if true, driver to clean output folder first if exists.
     void setQ(int q)
              sets q, amount of additional power iterations to increase precision (0..2!).
     
    Methods inherited from class java.lang.Object
    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
     

    Constructor Detail

    SSVDSolver

    public SSVDSolver(org.apache.hadoop.conf.Configuration conf,
                      org.apache.hadoop.fs.Path[] inputPath,
                      org.apache.hadoop.fs.Path outputPath,
                      int ablockRows,
                      int k,
                      int p,
                      int reduceTasks)
    create new SSVD solver. Required parameters are passed to constructor to ensure they are set. Optional parameters can be set using setters .

    Parameters:
    conf - hadoop configuration
    inputPath - Input path (should be compatible with DistributedRowMatrix as of the time of this writing).
    outputPath - Output path containing U, V and singular values vector files.
    ablockRows - The vertical hight of a q-block (bigger value require more memory in mappers+ perhaps larger minSplitSize values
    k - desired rank
    p - SSVD oversampling parameter
    reduceTasks - Number of reduce tasks (where applicable)
    Throws:
    IOException - when IO condition occurs.
    Method Detail

    setcUHalfSigma

    public void setcUHalfSigma(boolean cUHat)

    setcVHalfSigma

    public void setcVHalfSigma(boolean cVHat)

    getQ

    public int getQ()

    setQ

    public void setQ(int q)
    sets q, amount of additional power iterations to increase precision (0..2!). Defaults to 0.

    Parameters:
    q -

    setComputeU

    public void setComputeU(boolean val)
    The setting controlling whether to compute U matrix of low rank SSVD.


    setComputeV

    public void setComputeV(boolean val)
    Setting controlling whether to compute V matrix of low-rank SSVD.

    Parameters:
    val - true if we want to output V matrix. Default is true.

    setMinSplitSize

    public void setMinSplitSize(int size)
    Sometimes, if requested A blocks become larger than a split, we may need to use that to ensure at least k+p rows of A get into a split. This is requirement necessary to obtain orthonormalized Q blocks of SSVD.

    Parameters:
    size - the minimum split size to use

    getSingularValues

    public double[] getSingularValues()
    This contains k+p singular values resulted from the solver run.

    Returns:
    singlular values (largest to smallest)

    getUPath

    public String getUPath()
    returns U path (if computation were requested and successful).

    Returns:
    U output hdfs path, or null if computation was not completed for whatever reason.

    getVPath

    public String getVPath()
    return V path ( if computation was requested and successful ) .

    Returns:
    V output hdfs path, or null if computation was not completed for whatever reason.

    isOverwrite

    public boolean isOverwrite()

    setOverwrite

    public void setOverwrite(boolean overwrite)
    if true, driver to clean output folder first if exists.

    Parameters:
    overwrite -

    getOuterBlockHeight

    public int getOuterBlockHeight()

    setOuterBlockHeight

    public void setOuterBlockHeight(int outerBlockHeight)
    The height of outer blocks during Q'A multiplication. Higher values allow to produce less keys for combining and shuffle and sort therefore somewhat improving running time; but require larger blocks to be formed in RAM (so setting this too high can lead to OOM).

    Parameters:
    outerBlockHeight -

    getAbtBlockHeight

    public int getAbtBlockHeight()

    setAbtBlockHeight

    public void setAbtBlockHeight(int abtBlockHeight)
    the block height of Y_i during power iterations. It is probably important to set it higher than default 200,000 for extremely sparse inputs and when more ram is available. y_i block height and ABt job would occupy approx. abtBlockHeight x (k+p) x sizeof (double) (as dense).

    Parameters:
    abtBlockHeight -

    isBroadcast

    public boolean isBroadcast()

    setBroadcast

    public void setBroadcast(boolean broadcast)
    If this property is true, use DestributedCache mechanism to broadcast some stuff around. May improve efficiency. Default is false.

    Parameters:
    broadcast -

    run

    public void run()
             throws IOException
    run all SSVD jobs.

    Throws:
    IOException - if I/O condition occurs.

    loadDistributedRowMatrix

    public static double[][] loadDistributedRowMatrix(org.apache.hadoop.fs.FileSystem fs,
                                                      org.apache.hadoop.fs.Path glob,
                                                      org.apache.hadoop.conf.Configuration conf)
                                               throws IOException
    helper capabiltiy to load distributed row matrices into dense matrix (to support tests mainly).

    Parameters:
    fs - filesystem
    glob - FS glob
    conf - configuration
    Returns:
    Dense matrix array
    Throws:
    IOException - when I/O occurs.

    loadAndSumUpperTriangularMatrices

    public static UpperTriangular loadAndSumUpperTriangularMatrices(org.apache.hadoop.fs.FileSystem fs,
                                                                    org.apache.hadoop.fs.Path glob,
                                                                    org.apache.hadoop.conf.Configuration conf)
                                                             throws IOException
    Load multiplel upper triangular matrices and sum them up.

    Parameters:
    fs -
    glob -
    conf -
    Returns:
    the sum of upper triangular inputs.
    Throws:
    IOException

    loadUpperTriangularMatrix

    public static UpperTriangular loadUpperTriangularMatrix(org.apache.hadoop.fs.FileSystem fs,
                                                            org.apache.hadoop.fs.Path glob,
                                                            org.apache.hadoop.conf.Configuration conf)
                                                     throws IOException
    Load only one upper triangular matrix and issue error if mroe than one is found.

    Throws:
    IOException


    Copyright © 2008-2012 The Apache Software Foundation. All Rights Reserved.