org.apache.mahout.df.data
Class Data

java.lang.Object
  extended by org.apache.mahout.df.data.Data
All Implemented Interfaces:
java.lang.Cloneable

public class Data
extends java.lang.Object
implements java.lang.Cloneable

Holds a list of vectors and their corresponding Dataset. contains various operations that deals with the vectors (subset, count,...)


Constructor Summary
Data(Dataset dataset, java.util.List<Instance> instances)
           
 
Method Summary
 Data bagging(java.util.Random rng)
          if data has N cases, sample N cases at random -but with replacement.
 Data bagging(java.util.Random rng, boolean[] sampled)
          if data has N cases, sample N cases at random -but with replacement.
 Data clone()
           
 boolean contains(Instance v)
          Returns true is this data contains the specified element.
 void countLabels(int[] counts)
          Counts the number of occurrences of each label value
 boolean equals(java.lang.Object obj)
           
 int[] extractLabels()
          extract the labels of all instances
static int[] extractLabels(Dataset dataset, org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path)
          extract the labels of all instances from a data file
 Instance get(int index)
          Returns the element at the specified position
 Dataset getDataset()
           
 int hashCode()
           
 boolean identicalLabel()
          checks if all the vectors have identical label values
 int indexof(Instance v)
          Returns the index of the first occurrence of the element in this data
 boolean isEmpty()
          Returns true is this data contains no element
 boolean isIdentical()
          checks if all the vectors have identical attribute values
 int majorityLabel(java.util.Random rng)
          finds the majority label, breaking ties randomly
 Data rsplit(java.util.Random rng, int subsize)
          Splits the data in two, returns one part, and this gets the rest of the data.
 Data rsubset(java.util.Random rng, double ratio)
          Returns a random subset without modifying the current data
 int size()
          Returns the number of elements
 Data subset(Condition condition)
          Returns the subset from this data that matches the given condition
 double[] values(int attr)
          finds all distinct values of a given attribute
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Data

public Data(Dataset dataset,
            java.util.List<Instance> instances)
Method Detail

size

public int size()
Returns the number of elements

Returns:

isEmpty

public boolean isEmpty()
Returns true is this data contains no element

Returns:

contains

public boolean contains(Instance v)
Returns true is this data contains the specified element.

Parameters:
v - element whose presence in this list if to be searched
Returns:

indexof

public int indexof(Instance v)
Returns the index of the first occurrence of the element in this data

Parameters:
v - element to search for
Returns:
-1 if the element is not found

get

public Instance get(int index)
Returns the element at the specified position

Parameters:
index - index of element to return
Returns:
the element at the specified position
Throws:
java.lang.IndexOutOfBoundsException - if the index is out of range

subset

public Data subset(Condition condition)
Returns the subset from this data that matches the given condition

Parameters:
condition -
Returns:

rsubset

public Data rsubset(java.util.Random rng,
                    double ratio)
Returns a random subset without modifying the current data

Parameters:
rng - Random number generator
ratio - [0,1]
Returns:

bagging

public Data bagging(java.util.Random rng)
if data has N cases, sample N cases at random -but with replacement.

Parameters:
rng -
Returns:

bagging

public Data bagging(java.util.Random rng,
                    boolean[] sampled)
if data has N cases, sample N cases at random -but with replacement.

Parameters:
rng -
sampled - indicating which instance has been sampled
Returns:
sampled data

rsplit

public Data rsplit(java.util.Random rng,
                   int subsize)
Splits the data in two, returns one part, and this gets the rest of the data. VERY SLOW!

Parameters:
rng -
Returns:

isIdentical

public boolean isIdentical()
checks if all the vectors have identical attribute values

Returns:
true is all the vectors are identical or the data is empty
false otherwise

identicalLabel

public boolean identicalLabel()
checks if all the vectors have identical label values

Returns:

values

public double[] values(int attr)
finds all distinct values of a given attribute

Parameters:
attr -
Returns:

clone

public Data clone()
Overrides:
clone in class java.lang.Object

equals

public boolean equals(java.lang.Object obj)
Overrides:
equals in class java.lang.Object

hashCode

public int hashCode()
Overrides:
hashCode in class java.lang.Object

extractLabels

public int[] extractLabels()
extract the labels of all instances

Returns:

extractLabels

public static int[] extractLabels(Dataset dataset,
                                  org.apache.hadoop.fs.FileSystem fs,
                                  org.apache.hadoop.fs.Path path)
                           throws java.io.IOException
extract the labels of all instances from a data file

Parameters:
dataset -
fs - file system
path - data path
Returns:
Throws:
java.io.IOException

majorityLabel

public int majorityLabel(java.util.Random rng)
finds the majority label, breaking ties randomly

Returns:
the majority label value

countLabels

public void countLabels(int[] counts)
Counts the number of occurrences of each label value

Parameters:
counts - will contain the results, supposed to be initialized at 0

getDataset

public Dataset getDataset()


Copyright © 2008-2010 The Apache Software Foundation. All Rights Reserved.