Remix.run Logo
xnorswap 9 hours ago

> slow, particularly for short keys such as integers

An interesting thing about the CLR is that the default hash for integers is:

    h(x) = x.
Which as well as being collision-free, also avoids the trap of a slow default hash.
tialaramex 4 hours ago | parent | next [-]

The identity function. Several C++ implementations choose this too. It is very cheap but has obvious problems which may make you wish you'd paid more up front.

If you want this in Rust you can use: https://crates.io/crates/integer-hasher -- being able to swap out the hasher is something C++ folks have kinda wanted to do for like 15-20 years but they have never got it over the line.

I have some benchmarks I've been noodling with for a while, measuring different ways to do a hashtable. I call the one where we just do this operation but otherwise use the ordinary Rust Swiss tables - IntHashMap in this code. For some operations, and especially at small sizes, IntHashMap is significantly better

But, for other operations, and especially at large sizes, it's worse.

For example suppose we have 10 K->V pairs in our hash table, when we're looking for one of the ten K values, we're much faster in IntHashMap. However, if it's not there, IntHashMap is slightly slower. Further, if we have the first ten thousands numbers instead of ten thousand random numbers, like if we'd made a hash table of serial numbers for something - we're ten times worse in IntHashMap and that's because our hashing function though fast, is very bad at its job.

mrec 6 hours ago | parent | prev | next [-]

Do you know how it then maps hashes to buckets? I'd expect the natural occurrence of integer values to be very non-uniform and heavily biased toward the small end.

Tuna-Fish 7 hours ago | parent | prev [-]

But if you know the hash table implementation, it makes forcing collisions trivial for user-generated input, leading to easy denial of service attacks.

The first requirement for safe hashtable implementations is a secret key, which makes it impossible to know the hash value for an external observer. (Or even, to know the relative hash value between any two inputs.)

josefx 5 hours ago | parent | next [-]

> The first requirement for safe hashtable implementations is a secret key,

Some languages use different approaches. The buckets in a Java HashMap turn into a sorted tree if they grow too large. Then there are trivial solutions like adding an input limit for untrusted users. Either way works, is secure and doesn't depend on a secret key.

xnorswap 7 hours ago | parent | prev [-]

Did you reply to the wrong comment, I'm not sure it follows from what I posted?

You can't force a collision for the function f(x) = x, by definition it has no collisions. It's not just collision-resistant, it's actually collision proof!

hash(x) = x is of course not a cryptographic hash, but it also has the advantage that it's unlikely to be confused for one if anyone looks at the output.

kstrauser 6 hours ago | parent | next [-]

Nitpick: it’s still going to collide when taking the modulo to decide what bucket to put it in. If you know that the hashmap has 1000 buckets, and you spam it with multiples of 1000, you could make it put every key into the same bucket and have O(1) lookups turn into O(n). That doesn’t happen with real hash functions with random seeds.

xnorswap 6 hours ago | parent [-]

Oh right, got you.

afishhh 6 hours ago | parent | prev [-]

It's not about collisions of the hash function itself.

Every hashtable implementation will put the hash value through some sort of modulo because you generally don't want to waste memory storing the key 5726591 at index 5726591 in an array.

So if you know how the implementation works and the hash function is predictable you can keep providing the program with values that will consistently go into the same bucket resulting in linear lookups and insertions.