xkcd #2934: Bloom Filter

Alt text:

Sometimes, you can tell Bloom filters are the wrong tool for the job, but when they’re the right one you can never be sure.

Image

Image alternative text

Venat0r, 20 days ago (edited 20 days ago)

Not to be confused with Bloom shader effect.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

LPThinker, 23 days ago

For anyone interested in learning more about bloom filters, this is a technical but extremely accessible and easy to follow introduction to them, including some excellent interactive visualizations: https://samwho.dev/bloom-filters/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

gbuttersnaps, 22 days ago

This was a great read, thanks for sharing!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Audalin, 23 days ago

There’s a recent algorithm using somewhat similar ideas for approximate counting of unique objects in a stream with constant memory:

quantamagazine.org/computer-scientists-invent-an-…

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Mubelotix, 23 days ago

I think I like hash-based probabilistic counting better, but this is interesting

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Xeroxchasechase, 23 days ago

Is’nt bloom filter a shader that makes the picture look hazy and bright?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

rockSlayer, 23 days ago

A bloom filter is a data structure that is most useful in creating a spell check algorithm

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hydroptic, 22 days ago

That’s definitely not what they’re most useful for. I mean, you probably can use a bloom filter for implementing spell check, but saying that’s where they’re most useful severely misses the point of probabilistic set membership queries.

Bloom filters and their relatives are great when you have a huge set of values – eg. 100s of millions of user IDs in some database – and you want to have a very fast way of checking whether some value might be in that set, without having to query the database. Naturally this assumes that you’ve prepopulated a bloom filter with whatever values you need to be checking.

If the result of the bloom filter query is “nope”, you know that the value’s definitely not in the set, but if the result is “maybe” then you can go ahead and double-check by querying the database. This means that the vast majority of checks don’t have to hit that slow DB at all, and even though you’ll get some false positives this’ll still be much much much faster than having to go through that DB every time.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jaybone, 22 days ago

In this example, what would you use to prepopulate the filter?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hydroptic, 22 days ago

Which example do you mean?

If you meant my user ID example, you’d prepopulate the bloom filter with existing user IDs on eg. service startup or whatever, and then update the filter every time a new user ID is added – keeping in mind that the false positive rate will grow as more are added, and that at some point you may need to create a new filter with a bigger backing bit array

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

jaybone, 22 days ago

So you’re just putting a bunch of values in memory that you can access quickly, like similar to a hash set contains(), maybe hoping for O(1) time. But other than that there’s no trick to it?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hydroptic, 22 days ago (edited 22 days ago)

Well, yes and no. With a straight-up hash set, you’re keeping set_size * bits_per_element bits plus whatever the overhead of the hash table is in memory, which might not be tenable for very large sets, but with a Bloom filter that has eg. ~1% false positive rate and an ideal k parameter (number of hash functions, see eg. the Bloom filter wiki article) you’re only keeping ~10 bits per element completely regardless of element size because they don’t store the elements themselves or even their full hashes – they only tell you whether some element is probably in the set or not, but you can’t eg. enumerate the elements in the set. As an example of memory usage, a Bloom filter that has a false positive rate of ~1% for 500 million elements would need 571 MiB (noting that although the size of the filter doesn’t grow when you insert elements, the false positive rate goes up once you go past that 500 million element count.)

Lookup and insertion time complexity for a Bloom filter is O(k) where k is the parameter I mentioned and a constant – ie. effectively O(1).

Probabilistic set membership queries are mainly useful when you’re dealing with ginormous sets of elements that you can’t shove into a regular in-memory hash set. A good example in the wiki article is CDN cache filtering:

Nearly three-quarters of the URLs accessed from a typical web cache are “one-hit-wonders” that are accessed by users only once and never again. It is clearly wasteful of disk resources to store one-hit-wonders in a web cache, since they will never be accessed again. To prevent caching one-hit-wonders, a Bloom filter is used to keep track of all URLs that are accessed by users. A web object is cached only when it has been accessed at least once before, i.e., the object is cached on its second request.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

mindbleach, 22 days ago

It’s named after a guy - Dr. Burton Howard Filter.

Seriously though, a bunch of STEM concepts are coincidentally discovered by people with weirdly appropriate names. Airy disks cause photographic blur. Dove prisms resemble a dovetail joint but are pronounced doh-vay. Radon transforms in nuclear medicine involve no radiation.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Die4Ever, 23 days ago

en.m.wikipedia.org/wiki/Bloom_filter

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Ephera, 23 days ago

That’s the bloom shader effect: en.wikipedia.org/wiki/Bloom_(shader_effect)

But yeah, some people might refer to that as “bloom filter”, although it’s not what’s meant here.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Xeroxchasechase, 23 days ago

Thanks

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

hydroptic, 23 days ago

Do you want to learn about probabilistic data structures?

maybe

no

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

randomaccount43543, 23 days ago

explainxkcd.com/2934

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

Add comment