Guides and Best Practices Tech Deep Dives

Probabilistic Data Structures for Big Data and Streaming Applications

This blog post explores Bloom filter and a few other probabilistic data structures based on similar concepts (Counting Bloom filter, Bloom filter with deletion, Count-Min Sketch, cuckoo filter, and HyperLogLog). Probabilistic data structures use hash functions to randomize and compactly represent a set of items, making them extremely useful for big data and streaming applications. For the sake of simplicity, I will only cover the theory and rationale behind these data structures and skip any mathematics, proofs, or formulas around them. The code examples in this blog are in Golang. They are for demonstration only (not the best possible implementations). Links to better implementations are listed at the end of this post.

Applications of probabilistic data structures

Bloom filter was invented in 1970 by Burton Howard Bloom. Some of its more common applications include:

  1. Determining whether a user ID or domain is already taken
  2. Reducing disk lookups for the non-existing keys in a database
  3. Filtering out previously shown posts on recommendation engines
  4. Checking words for misspellings and profanity with a spellchecker
  5. Identifying malicious URLs, blocked IPs, and fraudulent transactions
  6. Counting the number of active users or total unique clicks on a website (with HyperLogLog, which is built on top of Bloom Filter)
  7. Determining heavy hitters (with Count-Min Sketch, which is built on top of Bloom Filter)

Without Bloom filter, our hypothetical hosting company would have had to solve its set-membership problem in a less-sophisticated way. For example, it might use a method involving a regular set or hash map, which requires O(N) time and space, where N is the number of elements in the data set. Bloom filter reduces linear complexity to constant space and time operations. In other words, it trades accuracy for efficiency.

Bloom filter’s accuracy depends on:

  1. Number of hash functions
  2. Quality of hash functions (uniform random distribution)
  3. Size of the hash map (directly proportional to accuracy)
  4. Number of elements in the map (inversely proportional to accuracy)

How Bloom filter works:

Operation Steps
Initialize A bitmap or a Boolean array and some hash functions. All bits are initially set to zero/false.
Insert Hash the key to be inserted and set all the resulting index positions in the bitset/array. If an index is already set, nothing further is required. Insertion never fails, but at the cost of an ever-increasing false-positive rate as more elements are added. When the error rate becomes unacceptable, the Bloom filter must be recreated or extended to keep it in check.
Delete Bloom filter does not support deletion, because unsettling the bits due to a non-existent key could mark an existing key as non-existent. This would corrupt the Bloom filter.
Search/Query Hash the key to be searched and check all resulting index positions in the bitset/array. If any bit is zero/false, return “false” to indicate that the key is not present. If any bit is one/true, return “true” to indicate that the key may be present.

Hash Function

1 type HashFunc func(string, uint) uint
2  
3 // HashAsciiSumFunc adds the ASCII values of each character of the
4 // input key and mods it with the size of the bitset/array.
5 // NOTE: This is a very basic and a poor hash function.
6 func HashAsciiSumFunc(key string, mod uint) uint {
7     var hash uint
8     for _, ch := range key {
9         hash = (hash + uint(ch)) % mod
10     }
11     return hash
12 }

Bloom Filter

1 type BloomFilter struct {
2     size    uint
3     hashmap []bool // use bitset to save space
4     hashes  []HashFunc
5 }
6  
7 func NewBloomFilter(size uint, hashes []HashFunc) *BloomFilter {
8     return &BloomFilter{
9         size:    size,
10         hashmap: make([]bool, size),
11         hashes:  hashes,
12     }
13 }
14  
15 func (bf *BloomFilter) Insert(key string) {
16     for _, hash := range bf.hashes {
17         h := hash(key, bf.size)
18         bf.hashmap[h] = true
19     }
20 }
21  
22 func (bf *BloomFilter) Find(key string) bool {
23     for _, hash := range bf.hashes {
24         h := hash(key, bf.size)
25         if !bf.hashmap[h] {
26             return false
27         }
28     }
29  
30     return true // false positive
31     }

Run Bloom Filter

1 func main() {
2     hashes := []HashFunc{HashAsciiSumFunc}   
3     bf := NewBloomFilter(10, hashes)
4  
5     fmt.Println("bloom filter insert : abc")
6     bf.Insert("abc") // (97+98+99)%size
7  
8     // (120+121+122)%size // returns false
9     fmt.Printf("bloom filter find : xyz = %t\n", bf.Find("xyz"))
10  
11     // (99+98+97)%size // returns true // false positive
12     fmt.Printf("bloom filter find : cab = %t\n", bf.Find("cab"))
13 }

Advantages of Bloom filter:

  1. It uses constant space, regardless of the number of elements inserted.
  2. No false negatives, so you can trust the Bloom filter when it says the item does not exist.
  3. Adding an element never fails.
  4. It does not store the actual elements, ensuring privacy out of the box.

Disadvantages of Bloom filter:

  1. It can return false positives, so you can’t always trust the Bloom filter when it says the element exists.
  2. Adding elements never fails, but at the cost of an ever-increasing false positive rate.
  3. Reducing false-positive rates requires an additional bit array or recreation of the Bloom filter.
  4. Cannot retrieve the inserted elements.
  5. Cannot delete the inserted elements.

Counting Bloom Filter

Counting Bloom filters are an extension of the Bloom filter that allows deletion of elements by storing the frequency of occurrence, instead of a Boolean flag in the array. The index counter is incremented on insertion and decremented on deletion. It is important to check the key for existence before deletion, since decrementing the count for non-existing keys could invalidate the count for existing keys, corrupting the Bloom filter. It can also be used as a regular Bloom filter, where a zero entry corresponds to “false” and a non-zero entry indicates “true” (of course, this takes more space than the Bloom filter). The Counting Bloom filter can also be used to determine whether the count of an element is less than a specified threshold.

Extended Bloom Filter

1 type BloomFilterEx struct {
2     size   uint
3     sketch [][]uint // seperate hashmap for each hash function
4     hashes []HashFunc
5 }
6  
7 func NewBloomFilterEx(size uint, hashes []HashFunc) *BloomFilterEx {
8     sketch := make([][]uint, len(hashes))
9     for i := 0; i < len(hashes); i++ {
10         sketch[i] = make([]uint, size)
11     }
12     return &BloomFilterEx{
13         size:   size,
14         sketch: sketch,
15         hashes: hashes,
16     }
17 }
18  
19 func (bfe *BloomFilterEx) Insert(key string) {
20     for i, hash := range bfe.hashes {
21         // this can be done concurrently
22         h := hash(key, bfe.size)
23         bfe.sketch[i][h]++
24     }
25 }
26  
27 func (bfe *BloomFilterEx) Delete(key string) bool {
28     // delete only if the key exists
29     if !bfe.Find(key) {
30         return false
31     }
32  
33     for i, hash := range bfe.hashes {
34         // this can be done concurrently
35         h := hash(key, bfe.size)
36         bfe.sketch[i][h]--
37     }
38  
39     return true
40 }
41  
42 func (bfe *BloomFilterEx) Find(key string) bool {
43     for i, hash := range bfe.hashes {
44         h := hash(key, bfe.size)
45         if bfe.sketch[i][h] == 0 {
46             return false
47         }
48     }
49  
50     return true // false positive
51 }

Counting Bloom Filter

1 type CountingBloomFilter struct {
2     *BloomFilterEx
3 }
4  
5 func NewCountingBloomFilter(
6     size uint,
7     hashes []HashFunc,
8 ) *CountingBloomFilter {
9     return &CountingBloomFilter{
10         BloomFilterEx: NewBloomFilterEx(size, hashes),
11     }
12 }

Run Counting Bloom Filter

1 func main() {
2     hashes := []HashFunc{HashAsciiSumFunc}
3     cbf := NewCountingBloomFilter(10, hashes)
4  
5     cbf.Insert("abc")
6     fmt.Println(cbf.Find("xyz")) // returns false
7     fmt.Println(cbf.Find("abc")) // returns true
8     fmt.Println(cbf.Find("cab")) // returns true // false positive
9     cbf.Delete("cab")            // succeeds // poor hash distribution
10    fmt.Println(cbf.Find("abc")) // returns false
11    fmt.Println(cbf.Find("cab")) // returns false
12 }

Count-Min Sketch

Count-Min Sketch is a probabilistic data structure that serves as a frequency table of events in a stream of data. It is practically the same as the Counting Bloom filter. It uses hash functions to map events to frequencies, but unlike a hash table, uses only sub-linear space, at the expense of overcounting some events, due to collisions. It indicates that a key is present K times, at most (it may overestimate but it never underestimates).

Count-Min Sketch

1 // CountMinSketch indicates that a key is present atleast K times
2 type CountMinSketch struct {
3     *BloomFilterEx
4 }
5  
6 func NewCountMinSketch(size uint, hashes []HashFunc) *CountMinSketch {
7     return &CountMinSketch{BloomFilterEx: NewBloomFilterEx(size, hashes)}
8 }
9  
10 func (cms *CountMinSketch) Count(key string) uint {
11     minCount := ^uint(0)
12     for i, hash := range cms.hashes {
13         h := hash(key, cms.size)
14         if cms.sketch[i][h] < minCount {
15             minCount = cms.sketch[i][h]
16         }
17     }
18     return minCount // key not present more than minCount times
19 }

Run Count-Min Sketch

1 func main() {
2     hashes := []HashFunc{HashAsciiSumFunc}
3     cms := NewCountMinSketch(10, hashes)
4  
5     cms.Insert("abc")
6     cms.Insert("abc")
7     cms.Insert("abc")
8     cms.Insert("cab")
9     fmt.Println(cms.Count("xyz")) // does not exist
10    fmt.Println(cms.Count("abc")) // overestimates
11 }

Cuckoo filter

Cuckoo filter improves upon the design of the Bloom filter by offering deletion, limited counting, and bounded false-positive probability, while still maintaining a similar space complexity. This works on partial-key cuckoo hashing by using two (can be generalized for K) hashes and a map for each hash. Insertion involves trying with the first hash and, on collision, moving the existing element to the next hash and inserting in its place. This is why it’s called a “cuckoo filter,” because it is similar in action to the way a cuckoo lays its eggs in another bird’s nest and removes the existing eggs.

The cuckoo filter’s process of attempting insertion, looking for collision, moving the existing element to the next hash, and inserting in its place continues until it hits a cycle (where elements are moved from bucket to bucket on collisions during insertion). It may also fail after hitting a maximum number of retries, which can happen when the load factor becomes too high. In both cases, the filter must be recreated and the keys reinserted. Because of this, the insertion can take O(N) time (worst case), making insertion an amortized constant time operation. This is the primary difference between cuckoo and Bloom filters: with cuckoo filters. Searching and deletion take constant time, even in the worst case.

HyperLogLog

HyperLogLog is less like a data structure and more like an algorithm used to determine the number of distinct elements in a multiset or data stream. It indicates the number of unique keys present in a multiset or the cardinality of a multiset. The algorithm works on the following observation, which considers a binary number seen in a randomly distributed stream:

  1. The probability of a number like 1XXXXXXX is 0.5 (1 of 2)
  2. The probability of a number like 01XXXXXX is 0.25 (1 of 4)
  3. The probability of a number like 001XXXXX is 0.125 (1 of 8)
  4. The probability of a number like 0000001X is 0.008 (1 of 128)

Insertion involves hashing the key, partitioning it into sub-streams/buckets, and keeping track of the maximum number of leading zeros in each stream/bucket. The number of elements in a stream/bucket is proportional to the function pow (2, max number of leading zeros). Finding cardinality involves taking the harmonic mean of all the max values across streams and raising it to the power of two. LogLog and SuperLogLog are algorithms based on the same idea. They only vary in the way the final cardinality estimation is done. The difference is listed in the comparison section.

Comparisons: Bloom filter vs. the other probabilistic data structures

  1. Count-Min Sketches are essentially the same data structure as the counting Bloom filter. However, they are used differently and therefore sized differently: a Count-Min Sketch typically has a sublinear number of cells related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set.
  2. Cuckoo filters are an enhancement over the Bloom filter since they support deletion. In addition, for applications that store many items and target moderately low false-positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters. Also, the Bloom filter never fails insertion and eventually leads to a 100% false-positive rate, whereas the cuckoo filter fails insertion after a threshold load factor by getting into an infinite loop.
  3. The LogLog, SuperLogLog, and HyperLogLog all work on the same concept. They only differ in their ways to reduce the variance. HyperLogLog has the lowest error rate of the three:
    • LogLog takes the geometric mean of all the values across buckets.
    • SuperLogLog discards the top 30% of the buckets to remove outliers and takes the geometric mean of the values of the other 70% buckets.
    • HyperLogLog takes the harmonic mean of all the values across buckets.

Space and Time Complexity

Data Structure Solves Variables Used Time Complexity Space Complexity
Bloom filter Set membership problem M bit array and K hash functions (K is usually 1) O(K) O(MK)
Counting Bloom filter Set membership problem M bit array and K hash functions (K is usually 1) O(K) O(MK)
Count-Min Sketch Event occurrence problem M bit array and K hash functions O(K) O(KM)
Cuckoo filter Set membership problem M bit array and K hash function (K is usually 2) O(K) O(KM)
HyperLogLog Count distinct problem or
cardinality estimation problem
K streams/buckets and N elements O(K) O(KloglogN)

Probabilistic data structures are undergoing a lot of research and they are being widely used in applications with high scale. This blog covers just a few. You can read about others here. Let me know your thoughts in the comments and stay tuned for my subsequent posts.

Some available Bloom filter implementations:

  1. https://llimllib.github.io/bloomfilter-tutorial/
  2. https://bdupras.github.io/filter-tutorial/
  3. https://hur.st/bloomfilter/
  4. https://github.com/Callidon/bloom-filters

Related Articles

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *