Bloom Filters – Explained

The Bloom filter is a space efficient, probabilistic data structure – used to test whether an item does not belong to a collection. Most people use the definition that it tests whether an item is in a collection, but I think the latter explanation is more practical . The catch with Bloom filters are – they can tell you with good accuracy that an item is not within a collection, but it sometimes lies about an item being in a collection – called a false positive. Bloom filters are mostly used as time savers, for example…you have a database on the internet with thousands of items, instead of wasting precious resources checking for items that might not even be in the database, we use the Bloom Filter – Some websites actually send a bloom filter to the client side browser, saving bandwidth as an added benefit.

Internally Bloom filters use a bit array, and multiple different hash functions. Lets say for instance we have a bit array of a 100 elements, and 3 hash functions. We want to insert the word “Singapore” into the filter, so we pass it through hash 1 – returns 33, hash 2 – returns 7 and hash 3 – returns 22. Next we go to each of those elements in the array and set them to 1, and that’s it – the word has now been included. Now to test whether the word might be in the collection, we pass it through the three hash functions, and check those elements in the bit array, if all 3 elements are set to 1, then its a positive, and if any one of the elements are set to zero, it’s a negative, meaning the word is not in the collection. Below is an example…

Bloom Filter Diagram
Bloom Filter Diagram

The efficiency of the filter relies upon the number of bits in the array, number of individual hash functions, and most importantly the quality of the hash functions. In a real world example, we would use 14 hash functions, and a bit array of millions of elements. xxHash is one of the best hash functions available, other examples are MurmurHash, FNV Hash, Jenkins Hashes, pseudo-number generators etc.. Uniformity in the Distribution of the hash functions output being a key factor. Here are some simple formulas for calculating the parameters of a bloom filter:

m – number of elements in bit array
k – number of hash functions
n – number of items in collection
p – desired false positive probability  // 0.0 – 1.0
^ – power

m = -((n*ln(p))/(ln(2)^2));
k = (m/n) * ln(2);
n = (m * ln(2))/k;
p = e^(-(m/n) * (ln(2)^2));    // estimated probability of false positives

// lets do a test, i know n = 1000,
// and i desire a p = 0.01 :

m = 9585.058 = 9586
k  =  6.644 = 7

// so for 100 items in a collection,
// a desired probability of 1% false positives
// you need a bit array of 9586 elements and 7 hash functions */

Lastly I would like to add, I’ve experimented with a number of my own devised hash functions, using xor and multiplication with good results. I have also used  only one hash function with different seed values successfully, so go ahead and experiment. I hope to include source code soon in the future, give me feedback on which language you would like it written in. I’ve created a small Bloom filtered spell checker in Delphi.

Thank you for reading – I’ll keep on editing this tutorial in the future.

Link outs :
http://en.wikipedia.org/wiki/Bloom_filter
http://www.jasondavies.com/bloomfilter/

7 thoughts on “Bloom Filters – Explained

  1. Nice and brief introduction to Bloom filters. I like it. However, it appears the image isn’t working.
    Regarding what language I would like to see it implement in, how about Julia?

    Like

  2. So logically I can conceive of and understand scenarios where a bloom filter is useful (ie a fast filter for Chromes malicious website list). However, hypothetically if you know you have set that is relatively small, say the 1000 items you give as an example, would it not be more time/space efficient to have a direct one to one mapping of that set to array bit positions (conceptually a hash which produces a unique output per item in the set). Your bit array would only be 1000 long and the hash/indexing algorithm would probably be much quicker, as it would identify the key bits that differentiate the items in the set.

    On much larger sets that are many orders of magnitude larger and have no structure to the data (appear random) I can understand this sort of filter being useful.

    Some further insight on this would be interesting.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s