Remix.run Logo
hijinks 2 days ago

is there a better way then bloom filters to handle needle in the haystack type searches where the haystack might be terabytes of data and you only want a few lines?

philipkglass 2 days ago | parent [-]

There are a lot of "better than Bloom" filters that work similarly in some aspects. I have used Cuckoo [1] and Ribbon [2] filters for Bloom-type applications. If you have an application where you do a lot of one kind of searching, it may also be worth implementing a specialized variant of a data structure. I needed a Cuckoo-type filter on the JVM but only for 64 bit integers and I was able to make a smaller, faster code base that was specialized to this data type instead of handling generic objects.

You need to know up front whether you need to be able to dynamically add entries to the filter or if your application can tolerate rebuilding the filter entirely whenever the underlying data changes. In the latter case you have more freedom to choose data structures; many of the modern "better than Bloom" filters are more compact but don't support dynamic updates.

[1] https://en.wikipedia.org/wiki/Cuckoo_filter

[2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-filt...

hinkley 2 days ago | parent | next [-]

I wonder how often in the wild people are tuning for a 1% false positive rate versus a much lower one, like .1%. You do quickly reach data set sizes where even 1% introduces some strain on resources or responsiveness.

Cuckoo claims 70% of the size of bloom for the same error rate, and the space is logarithmic to the error rate. Looks like about 6.6 bits per record versus 9.56 bits for bloom at 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you can get to about a .13% error rate for a cuckoo only a hair larger than the equivalent bloom filter (n^9.567 = 758.5)

FreakLegion a day ago | parent [-]

Cuckoo filters can do even better with the small adjustment of using windows instead of buckets. See "3.5-Way Cuckoo Hashing for the Price of 2-and-a-Bit": https://scispace.com/pdf/3-5-way-cuckoo-hashing-for-the-pric.... (This significantly improves load factors rather than changing anything else about the filter, and ends up smaller than the semi-sorted variant for typical configurations, without the rigmarole.)

My fairly niche use case for these kinds of data structures was hardware firewalls running mostly on SRAM, which needed a sub one-in-a-billion false positive rate.

hijinks 2 days ago | parent | prev [-]

thanks.. i'll read up into these.. always amazes me that companies like datadog somehow made log search quick