org.apache.lucene.util
Class FuzzySet

java.lang.Object
  extended by org.apache.lucene.util.FuzzySet

public class FuzzySet
extends Object

A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter.

Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

This class is NOT threadsafe.

Internally a Bitset is used to record values and once a client has finished recording a stream of values the downsize(float) method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Nested Class Summary
static class FuzzySet.ContainsResult
           
 
Field Summary
static int FUZZY_SERIALIZATION_VERSION
           
 
Method Summary
 void addValue(BytesRef value)
          Records a value in the set.
 FuzzySet.ContainsResult contains(BytesRef value)
          The main method required for a Bloom filter which, given a value determines set membership.
static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes, HashFunction hashFunction)
           
static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues, float desiredMaxSaturation, HashFunction hashFunction)
           
static FuzzySet deserialize(DataInput in)
           
 FuzzySet downsize(float targetMaxSaturation)
           
static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize, int numRecordedBits)
           
 int getEstimatedUniqueValues()
           
static int getNearestSetSize(int maxNumberOfBits)
          Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number.
static int getNearestSetSize(int maxNumberOfValuesExpected, float desiredSaturation)
          Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.
 float getSaturation()
           
 void serialize(DataOutput out)
          Serializes the data set to file using the following format: FuzzySet -->FuzzySetVersion,HashFunctionName,BloomSize, NumBitSetWords,BitSetWordNumBitSetWords HashFunctionName --> String The name of a ServiceProvider registered HashFunction FuzzySetVersion --> Uint32 The version number of the FuzzySet class BloomSize --> Uint32 The modulo value used to project hashes into the field's Bitset NumBitSetWords --> Uint32 The number of longs (as returned from FixedBitSet.getBits()) BitSetWord --> Long A long from the array returned by FixedBitSet.getBits()
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FUZZY_SERIALIZATION_VERSION

public static final int FUZZY_SERIALIZATION_VERSION
See Also:
Constant Field Values
Method Detail

getNearestSetSize

public static int getNearestSetSize(int maxNumberOfBits)
Rounds down required maxNumberOfBits to the nearest number that is made up of all ones as a binary number. Use this method where controlling memory use is paramount.


getNearestSetSize

public static int getNearestSetSize(int maxNumberOfValuesExpected,
                                    float desiredSaturation)
Use this method to choose a set size where accuracy (low content saturation) is more important than deciding how much memory to throw at the problem.

Parameters:
maxNumberOfValuesExpected -
desiredSaturation - A number between 0 and 1 expressing the % of bits set once all values have been recorded
Returns:
The size of the set nearest to the required size

createSetBasedOnMaxMemory

public static FuzzySet createSetBasedOnMaxMemory(int maxNumBytes,
                                                 HashFunction hashFunction)

createSetBasedOnQuality

public static FuzzySet createSetBasedOnQuality(int maxNumUniqueValues,
                                               float desiredMaxSaturation,
                                               HashFunction hashFunction)

contains

public FuzzySet.ContainsResult contains(BytesRef value)
The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false.

Parameters:
value -
Returns:
NO or MAYBE

serialize

public void serialize(DataOutput out)
               throws IOException
Serializes the data set to file using the following format:

Parameters:
out - Data output stream
Throws:
IOException

deserialize

public static FuzzySet deserialize(DataInput in)
                            throws IOException
Throws:
IOException

addValue

public void addValue(BytesRef value)
              throws IOException
Records a value in the set. The referenced bytes are hashed and then modulo n'd where n is the chosen size of the internal bitset.

Parameters:
value - the key value to be hashed
Throws:
IOException

downsize

public FuzzySet downsize(float targetMaxSaturation)
Parameters:
targetMaxSaturation - A number between 0 and 1 describing the % of bits that would ideally be set in the result. Lower values have better qccuracy but require more space.
Returns:
a smaller FuzzySet or null if the current set is already over-saturated

getEstimatedUniqueValues

public int getEstimatedUniqueValues()

getEstimatedNumberUniqueValuesAllowingForCollisions

public static int getEstimatedNumberUniqueValuesAllowingForCollisions(int setSize,
                                                                      int numRecordedBits)

getSaturation

public float getSaturation()


Copyright © 2000-2012 Apache Software Foundation. All Rights Reserved.