The Bloom filter is a space efficient, probabilistic data structure – used to test whether an item does not belong to a collection. Most people use the definition that it tests whether an item is in a collection, but I think the latter explanation is more practical . The catch with Bloom filters are – they can tell you with good accuracy that an item is not within a collection, but it sometimes lies about an item being in a collection – called a false positive. Bloom filters are mostly used as time savers, for example…you have a database on the internet with thousands of items, instead of wasting precious resources checking for items that might not even be in the database, we use the Bloom Filter – Some websites actually send a bloom filter to the client side browser, saving bandwidth as an added benefit.
Internally Bloom filters use a bit array, and multiple different hash functions. Lets say for instance we have a bit array of a 100 elements, and 3 hash functions. We want to insert the word “Singapore” into the filter, so we pass it through hash 1 – returns 33, hash 2 – returns 7 and hash 3 – returns 22. Next we go to each of those elements in the array and set them to 1, and that’s it – the word has now been included. Now to test whether the word might be in the collection, we pass it through the three hash functions, and check those elements in the bit array, if all 3 elements are set to 1, then its a positive, and if any one of the elements are set to zero, it’s a negative, meaning the word is not in the collection. Below is an example…
The efficiency of the filter relies upon the number of bits in the array, number of individual hash functions, and most importantly the quality of the hash functions. In a real world example, we would use 14 hash functions, and a bit array of millions of elements. xxHash is one of the best hash functions available, other examples are MurmurHash, FNV Hash, Jenkins Hashes, pseudo-number generators etc.. Uniformity in the Distribution of the hash functions output being a key factor. Here are some simple formulas for calculating the parameters of a bloom filter:
m – number of elements in bit array
k – number of hash functions
n – number of items in collection
p – desired false positive probability // 0.0 – 1.0
^ – power
m = -((n*ln(p))/(ln(2)^2)); k = (m/n) * ln(2); n = (m * ln(2))/k; p = e^(-(m/n) * (ln(2)^2)); // estimated probability of false positives // lets do a test, i know n = 1000, // and i desire a p = 0.01 : m = 9585.058 = 9586 k = 6.644 = 7 // so for 100 items in a collection, // a desired probability of 1% false positives // you need a bit array of 9586 elements and 7 hash functions */
Lastly I would like to add, I’ve experimented with a number of my own devised hash functions, using xor and multiplication with good results. I have also used only one hash function with different seed values successfully, so go ahead and experiment. I hope to include source code soon in the future, give me feedback on which language you would like it written in. I’ve created a small Bloom filtered spell checker in Delphi.
Thank you for reading – I’ll keep on editing this tutorial in the future.