Remix.run Logo
bryanrasmussen 2 days ago

a hash function that produce hashes that are already in the hash table should, IMO, not be called a hash function.

I get why technically it is a hash function, but still, no.

baruch 2 days ago | parent | next [-]

It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle

bryanrasmussen 2 days ago | parent [-]

ok damn, I did not know this, obviously. Thanks.

I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.

2 days ago | parent | next [-]
[deleted]
kro 2 days ago | parent | prev | next [-]

It's very very unlikely to get collisions there, but still not impossible. Whenever you map data of arbitrary length (infinite possibilities) to a limited length collisions are possible.

eru 2 days ago | parent [-]

Doesn't even have to be arbitrary length.

Whenever you map into a smaller space, you get collisions. The bigger space doesn't have to be infinite.

bryanrasmussen 2 days ago | parent [-]

with a password you may be mapping into a smaller space or a bigger space, because what you want is to get them all same length, but yeah you may in some cases be mapping into a smaller space, hadn't thought of that, although I sort of also think it is unlikely.

But there it doesn't matter anyway because the password is put together with the email to identify the user so in practicality passwords will never collide even if they could in theory.

eru 2 days ago | parent [-]

For passwords: the input _space_ is bigger. That doesn't say anything about the length of any particular password.

> But there it doesn't matter anyway because the password is put together with the email to identify the user so in practicality passwords will never collide even if they could in theory.

For passowrds, you are not worried so much about two users accidentally getting the same hash, you are worried about people finding a pre-image that hashes to the same output as your user's password.

bryanrasmussen 2 days ago | parent [-]

>For passowrds, you are not worried so much about two users accidentally getting the same hash

right I was thinking I've probably personally never had a situation where a collision would have affected anything, but then I thought of one, when I had to do image hashing it was a potential problem.

ogogmad 2 days ago | parent | prev [-]

Let's consider a hash table with an allocation of 1MB, which is about 2^20 bytes. Assume also that each entry occupies a byte. Assuming that the hash function's values are distributed randomly, the probability of there being a collision with only 1000 entries is approximately 38% = 1-(2^20)!/(2^20 - 1000)!/(2^20)^1000. See the "Birthday Problem".

ahazred8ta 2 days ago | parent | prev | next [-]

A perfect hash function https://en.wikipedia.org/wiki/Perfect_hash_function has to be specially constructed for each desired set of inputs. Generic hash functions cannot be 'perfect'.

phinnaeus 2 days ago | parent | prev [-]

Here is a hash function that does not have hash collisions:

  fn hash(data):
    return data
Charon77 2 days ago | parent | next [-]

Well it no longer constrains the data in a fixed output length.

dbdr 2 days ago | parent [-]

Sure, but if you constrain to fixed output length, you will definitely have collisions (Pigeon Hole Principle). There's no way around that.

hsbauauvhabzb 2 days ago | parent [-]

padding with zeroes to a fixed length and prepending the original length would suffice, but you’d have to have a fixed length of double infinity to account for both the length information and the hash information, and the hash is less efficient than the original information.

tux3 2 days ago | parent | prev | next [-]

That is a function, but not a hash function!

oliver236 2 days ago | parent | prev [-]

what programming language is this?