I always find it interesting how often the simplest hash table layouts end up performing best in real workloads. Once you avoid pointer chasing and keep everything in a compact array, CPU caches do most of the heavy lifting.

It’s also a good reminder that clarity of layout often beats more “clever” designs, especially when the dataset fits comfortably in memory.

▲

hinkley 3 days ago | parent | next [-]

Until you get high memory contention from the rest of the code. Once eviction gets high you get some pretty counterintuitive improvements by fixing things that seem like they shouldn’t need to be fixed.

My best documented case was a 10x speed up from removing a double lookup that was killing caches.

▲

crest 3 days ago | parent [-]

My best improvment was just bit-interleaving both axes of a 2x32bit integer coordinate (aka z-curve). I obtained factor ~100x (yes factor not percent) throughput improvement over locality in only one dimension. All it took was ~10 lines of bit twiddling. The runtime went from a bit above 300ms to slightly less then 3ms.

▲

hinkley 3 days ago | parent | next [-]

End to end gets weird. I was asked to look at an admin page, nobody could figure out why it was 30s. Literally the first thing I tried got it under 4 and the second down to three. It was pulling the same list of rows twice, applying two filters and then looking at the intersection. I changed the signature to send the list as input instead of the query constraints. Then I changed them to avoid the intersect.

If you would have asked me to bet on which one would have had the bigger impact I would have split the difference.

My second favorite was similar. Two functions making a call instead of sharing the answer. Profiler said 10% cumulative. I removed half. Instead of 5% I got 20%. Which just demonstrates how much data a profiler cannot show you.

▲

throw-the-towel 3 days ago | parent | prev [-]

I'm wondering how do you folk even come up with this kind of optimisations.

	▲	hinkley 3 days ago \| parent [-]
		Sheer stubbornness. Profilers lie and some more than most. I’ve gotten 3x from code with a perfectly rectangular profile output. Part of it comes down to a trick game devs steal from finance: give each task a budget and keep it to the budget even if it’s not the tall tent pole. You should not spend 10% of your response time on telemetry and logging combined. Yet I pulled 10% TTFB out of just the logging and telemetry code on a project. It was a frog boiling situation. Every new epic used the new code and determining the cumulative cost wasn’t easy.

▲

saltcured 3 days ago | parent | prev | next [-]

To me, these sorts of examples always seem contrived. To the first order, I've never had a real hash table problem that was on machine word keys.

I've nearly always had a variable length string or other complex structure that was being hashed, not their handles.

Back in my early career in C, this would be a generic API to hash and store void pointers, but the pointers were not being hashed. The domain-specific hash function needed to downcast and perform the appropriate remote memory access to fetch the variable-length material that was actually being hashed.

▲

jasonwatkinspdx 3 days ago | parent | prev [-]

I'm a big fan of the basic power of two choices hash table design. It's simple to understand and implement, has reasonable constant factors, and hits high load factors on real world datasets.

You can use more elaborate probe and relocation schemes, but just choosing the less full bucket and resizing if both choices are full gets you surprisingly far.

	▲	kccqzy 3 days ago \| parent [-]
		Power-of-two length is also the natural choice for a growable linear array where the designer has no idea how many elements there will be.