org.apache.hadoop.hbase.util
Class ByteBloomFilter

java.lang.Object
  extended by org.apache.hadoop.hbase.util.ByteBloomFilter
All Implemented Interfaces:
BloomFilter, BloomFilterBase, BloomFilterWriter

public class ByteBloomFilter
extends Object
implements BloomFilter, BloomFilterWriter

Implements a Bloom filter, as defined by Bloom in 1970.

The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.

Originally inspired by European Commission One-Lab Project 034819. Bloom filters are very sensitive to the number of elements inserted into them. For HBase, the number of entries depends on the size of the data stored in the column. Currently the default region size is 256MB, so entry count ~= 256MB / (average value size for column). Despite this rule of thumb, there is no efficient way to calculate the entry count after compactions. Therefore, it is often easier to use a dynamic bloom filter that will add extra space instead of allowing the error rate to grow. ( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey .pdf ) m denotes the number of bits in the Bloom filter (bitSize) n denotes the number of elements inserted into the Bloom filter (maxKeys) k represents the number of hash functions used (nbHash) e represents the desired false positive rate for the bloom (err) If we fix the error rate (e) and know the number of entries, then the optimal bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185) The probability of false positives is minimized when k = m/n ln(2).

See Also:
The general behavior of a filter, Space/Time Trade-Offs in Hash Coding with Allowable Errors

Field Summary
protected  ByteBuffer bloom
          Bloom bits
protected  long byteSize
          Bytes (B) in the array.
protected  Hash hash
          Hash Function
protected  int hashCount
          Number of hash functions
protected  int hashType
          Hash type
protected  int keyCount
          Keys currently in the bloom
static double LOG2_SQUARED
          Used in computing the optimal Bloom filter size.
protected  int maxKeys
          Max Keys expected for the bloom
static String STATS_RECORD_SEP
          Record separator for the Bloom filter statistics human-readable string
static int VERSION
          Current file format version
 
Constructor Summary
ByteBloomFilter(DataInput meta)
          Loads bloom filter meta data from file input.
ByteBloomFilter(int maxKeys, double errorRate, int hashType, int foldFactor)
          Determines & initializes bloom filter meta data from user config.
 
Method Summary
 double actualErrorRate()
          Computes the error rate for this Bloom filter, taking into account the actual number of hash functions and keys inserted.
static double actualErrorRate(long maxKeys, long bitSize, int functionCount)
          Computes the actual error rate for the given number of elements, number of bits, and number of hash functions.
 void add(byte[] buf)
           
 void add(byte[] buf, int offset, int len)
          Add the specified binary to the bloom filter.
 void allocBloom()
          Allocate memory for the bloom filter data.
 void compactBloom()
          Compact the Bloom filter before writing metadata & data to disk.
static long computeBitSize(long maxKeys, double errorRate)
           
static int computeFoldableByteSize(long bitSize, int foldFactor)
          Increases the given byte size of a Bloom filter until it can be folded by the given factor.
static long computeMaxKeys(long bitSize, double errorRate, int hashCount)
          The maximum number of keys we can put into a Bloom filter of a certain size to get the given error rate, with the given number of hash functions.
static boolean contains(byte[] buf, int offset, int length, byte[] bloomArray, int bloomOffset, int bloomSize, Hash hash, int hashCount)
           
 boolean contains(byte[] buf, int offset, int length, ByteBuffer theBloom)
          Check if the specified key is contained in the bloom filter.
 ByteBloomFilter createAnother()
          Creates another similar Bloom filter.
 byte[] createBloomKey(byte[] rowBuf, int rowOffset, int rowLen, byte[] qualBuf, int qualOffset, int qualLen)
          Create a key for a row-column Bloom filter.
static ByteBloomFilter createBySize(int byteSizeHint, double errorRate, int hashType, int foldFactor)
          Creates a Bloom filter of the given size.
static String formatStats(BloomFilterBase bloomFilter)
          A human-readable string with statistics for the given Bloom filter.
 long getByteSize()
           
 org.apache.hadoop.io.RawComparator<byte[]> getComparator()
           
 org.apache.hadoop.io.Writable getDataWriter()
          Get a writable interface into bloom filter data (the actual Bloom bits).
 int getHashCount()
           
 int getHashType()
           
 long getKeyCount()
           
 long getMaxKeys()
           
 org.apache.hadoop.io.Writable getMetaWriter()
          Get a writable interface into bloom filter meta data.
static long idealMaxKeys(long bitSize, double errorRate)
          The maximum number of keys we can put into a Bloom filter of a certain size to maintain the given error rate, assuming the number of hash functions is chosen optimally and does not even have to be an integer (hence the "ideal" in the function name).
static void setFakeLookupMode(boolean enabled)
           
 boolean supportsAutoLoading()
           
 String toString()
           
 void writeBloom(DataOutput out)
          Writes just the bloom filter to the output array
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

VERSION

public static final int VERSION
Current file format version

See Also:
Constant Field Values

byteSize

protected long byteSize
Bytes (B) in the array. This actually has to fit into an int.


hashCount

protected int hashCount
Number of hash functions


hashType

protected final int hashType
Hash type


hash

protected final Hash hash
Hash Function


keyCount

protected int keyCount
Keys currently in the bloom


maxKeys

protected int maxKeys
Max Keys expected for the bloom


bloom

protected ByteBuffer bloom
Bloom bits


STATS_RECORD_SEP

public static final String STATS_RECORD_SEP
Record separator for the Bloom filter statistics human-readable string

See Also:
Constant Field Values

LOG2_SQUARED

public static final double LOG2_SQUARED
Used in computing the optimal Bloom filter size. This approximately equals 0.480453.

Constructor Detail

ByteBloomFilter

public ByteBloomFilter(DataInput meta)
                throws IOException,
                       IllegalArgumentException
Loads bloom filter meta data from file input.

Parameters:
meta - stored bloom meta data
Throws:
IllegalArgumentException - meta data is invalid
IOException

ByteBloomFilter

public ByteBloomFilter(int maxKeys,
                       double errorRate,
                       int hashType,
                       int foldFactor)
                throws IllegalArgumentException
Determines & initializes bloom filter meta data from user config. Call allocBloom() to allocate bloom filter data.

Parameters:
maxKeys - Maximum expected number of keys that will be stored in this bloom
errorRate - Desired false positive error rate. Lower rate = more storage required
hashType - Type of hash function to use
foldFactor - When finished adding entries, you may be able to 'fold' this bloom to save space. Tradeoff potentially excess bytes in bloom for ability to fold if keyCount is exponentially greater than maxKeys.
Throws:
IllegalArgumentException
Method Detail

computeBitSize

public static long computeBitSize(long maxKeys,
                                  double errorRate)
Parameters:
maxKeys -
errorRate -
Returns:
the number of bits for a Bloom filter than can hold the given number of keys and provide the given error rate, assuming that the optimal number of hash functions is used and it does not have to be an integer.

idealMaxKeys

public static long idealMaxKeys(long bitSize,
                                double errorRate)
The maximum number of keys we can put into a Bloom filter of a certain size to maintain the given error rate, assuming the number of hash functions is chosen optimally and does not even have to be an integer (hence the "ideal" in the function name).

Parameters:
bitSize -
errorRate -
Returns:
maximum number of keys that can be inserted into the Bloom filter
See Also:
for a more precise estimate

computeMaxKeys

public static long computeMaxKeys(long bitSize,
                                  double errorRate,
                                  int hashCount)
The maximum number of keys we can put into a Bloom filter of a certain size to get the given error rate, with the given number of hash functions.

Parameters:
bitSize -
errorRate -
hashCount -
Returns:
the maximum number of keys that can be inserted in a Bloom filter to maintain the target error rate, if the number of hash functions is provided.

actualErrorRate

public double actualErrorRate()
Computes the error rate for this Bloom filter, taking into account the actual number of hash functions and keys inserted. The return value of this function changes as a Bloom filter is being populated. Used for reporting the actual error rate of compound Bloom filters when writing them out.

Returns:
error rate for this particular Bloom filter

actualErrorRate

public static double actualErrorRate(long maxKeys,
                                     long bitSize,
                                     int functionCount)
Computes the actual error rate for the given number of elements, number of bits, and number of hash functions. Taken directly from the Wikipedia Bloom filter article.

Parameters:
maxKeys -
bitSize -
functionCount -
Returns:
the actual error rate

computeFoldableByteSize

public static int computeFoldableByteSize(long bitSize,
                                          int foldFactor)
Increases the given byte size of a Bloom filter until it can be folded by the given factor.

Parameters:
bitSize -
foldFactor -
Returns:
Foldable byte size

createBySize

public static ByteBloomFilter createBySize(int byteSizeHint,
                                           double errorRate,
                                           int hashType,
                                           int foldFactor)
Creates a Bloom filter of the given size.

Parameters:
byteSizeHint - the desired number of bytes for the Bloom filter bit array. Will be increased so that folding is possible.
errorRate - target false positive rate of the Bloom filter
hashType - Bloom filter hash function type
foldFactor -
Returns:
the new Bloom filter of the desired size

createAnother

public ByteBloomFilter createAnother()
Creates another similar Bloom filter. Does not copy the actual bits, and sets the new filter's key count to zero.

Returns:
a Bloom filter with the same configuration as this

allocBloom

public void allocBloom()
Description copied from interface: BloomFilterWriter
Allocate memory for the bloom filter data.

Specified by:
allocBloom in interface BloomFilterWriter

add

public void add(byte[] buf)

add

public void add(byte[] buf,
                int offset,
                int len)
Description copied from interface: BloomFilterWriter
Add the specified binary to the bloom filter.

Specified by:
add in interface BloomFilterWriter
Parameters:
buf - data to be added to the bloom
offset - offset into the data to be added
len - length of the data to be added

contains

public boolean contains(byte[] buf,
                        int offset,
                        int length,
                        ByteBuffer theBloom)
Description copied from interface: BloomFilter
Check if the specified key is contained in the bloom filter.

Specified by:
contains in interface BloomFilter
Parameters:
buf - data to check for existence of
offset - offset into the data
length - length of the data
theBloom - bloom filter data to search. This can be null if auto-loading is supported.
Returns:
true if matched by bloom, false if not

contains

public static boolean contains(byte[] buf,
                               int offset,
                               int length,
                               byte[] bloomArray,
                               int bloomOffset,
                               int bloomSize,
                               Hash hash,
                               int hashCount)

getKeyCount

public long getKeyCount()
Specified by:
getKeyCount in interface BloomFilterBase
Returns:
The number of keys added to the bloom

getMaxKeys

public long getMaxKeys()
Specified by:
getMaxKeys in interface BloomFilterBase
Returns:
The max number of keys that can be inserted to maintain the desired error rate

getByteSize

public long getByteSize()
Specified by:
getByteSize in interface BloomFilterBase
Returns:
Size of the bloom, in bytes

getHashType

public int getHashType()

compactBloom

public void compactBloom()
Description copied from interface: BloomFilterWriter
Compact the Bloom filter before writing metadata & data to disk.

Specified by:
compactBloom in interface BloomFilterWriter

writeBloom

public void writeBloom(DataOutput out)
                throws IOException
Writes just the bloom filter to the output array

Parameters:
out - OutputStream to place bloom
Throws:
IOException - Error writing bloom array

getMetaWriter

public org.apache.hadoop.io.Writable getMetaWriter()
Description copied from interface: BloomFilterWriter
Get a writable interface into bloom filter meta data.

Specified by:
getMetaWriter in interface BloomFilterWriter
Returns:
a writable instance that can be later written to a stream

getDataWriter

public org.apache.hadoop.io.Writable getDataWriter()
Description copied from interface: BloomFilterWriter
Get a writable interface into bloom filter data (the actual Bloom bits). Not used for compound Bloom filters.

Specified by:
getDataWriter in interface BloomFilterWriter
Returns:
a writable instance that can be later written to a stream

getHashCount

public int getHashCount()

supportsAutoLoading

public boolean supportsAutoLoading()
Specified by:
supportsAutoLoading in interface BloomFilter
Returns:
true if this Bloom filter can automatically load its data and thus allows a null byte buffer to be passed to contains()

setFakeLookupMode

public static void setFakeLookupMode(boolean enabled)

createBloomKey

public byte[] createBloomKey(byte[] rowBuf,
                             int rowOffset,
                             int rowLen,
                             byte[] qualBuf,
                             int qualOffset,
                             int qualLen)
Create a key for a row-column Bloom filter. Just concatenate row and column by default. May return the original row buffer if the column qualifier is empty.

Specified by:
createBloomKey in interface BloomFilterBase

getComparator

public org.apache.hadoop.io.RawComparator<byte[]> getComparator()
Specified by:
getComparator in interface BloomFilterBase
Returns:
Bloom key comparator

formatStats

public static String formatStats(BloomFilterBase bloomFilter)
A human-readable string with statistics for the given Bloom filter.

Parameters:
bloomFilter - the Bloom filter to output statistics for;
Returns:
a string consisting of "<key>: <value>" parts separated by STATS_RECORD_SEP.

toString

public String toString()
Overrides:
toString in class Object


Copyright © 2015 The Apache Software Foundation. All Rights Reserved.