Remix.run Logo
Simplest Hash Functions(purplesyringa.moe)
29 points by ibobev 5 days ago | 19 comments
jandrewrogers an hour ago | parent | next [-]

A “simplest” hash function is completely dependent on what you are using the hash function for and the guarantees you want a hash function to make. An optimal permutation of an integer is different from a small string hash is different from a block checksum. Literally, you are optimizing the algorithm for entirely different properties. No algorithm can satisfy all of them even approximately.

The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.

benob 19 minutes ago | parent | prev | next [-]

I just realized that a hash function is nothing less than the output of a deterministic random number generator xored with some data

Sharlin 5 minutes ago | parent | prev | next [-]

  def hash(str):
    len(str)
O(1), baby!
teo_zero 28 minutes ago | parent | prev | next [-]

While I generally like to reinvent the wheel, for hash functions I strongly recommend to use a proved good one. Djb2 by the venerable Daniel Bernstein satisfies all the requirements of TFA.

  h = 5381
  while still has data:
    h = h * 33 + next_byte()
  return h
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.
kuzivaai 25 minutes ago | parent | prev | next [-]

The point about hash tables using top bits instead of bottom bits is the kind of thing that feels obvious once someone says it and yet here we are. Genuine question: have you seen any real-world hash table implementations that actually do this, or is it purely "this is what we should have done 40 years ago"?

Charon77 2 hours ago | parent | prev | next [-]

> Like addition

I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?

pjscott 42 minutes ago | parent | next [-]

The wording was a bit unclear. The previous paragraph mentions wanting something cheaper than "those pesky XORs and multiplications". The multiplication is the expensive part; the (very cheap) XORs are just mildly annoying because you have to think about what they're doing.

4k0hz 42 minutes ago | parent | prev [-]

At least on x86, multiple additions and multiplications can be done with a single `lea` instruction so it's preferable to XOR. Though I have no idea about other architectures, compiler implementations, any interpreters...

ygra 33 minutes ago | parent [-]

That only helps with multiplications by statically known word sizes (4x, 8x, etc.) and not arbitrary x·y. It can help with many smaller constant multipliers if the complete is clever, but it has to be known at compile time.

johanvts an hour ago | parent | prev | next [-]

This is a excellent article for anyone looking for some more in-depth analysis of tabulation based hashing methods: https://arxiv.org/abs/1505.01523

bryanrasmussen 2 hours ago | parent | prev [-]

a hash function that produce hashes that are already in the hash table should, IMO, not be called a hash function.

I get why technically it is a hash function, but still, no.

baruch an hour ago | parent | next [-]

It is mathematically impossible for a proper hash function (one with an output range smaller than its input range) to not have collisions. The proof uses the Pigeon Hole Principle https://en.wikipedia.org/wiki/Pigeonhole_principle

bryanrasmussen 35 minutes ago | parent [-]

ok damn, I did not know this, obviously. Thanks.

I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.

ahazred8ta 2 hours ago | parent | prev | next [-]

A perfect hash function https://en.wikipedia.org/wiki/Perfect_hash_function has to be specially constructed for each desired set of inputs. Generic hash functions cannot be 'perfect'.

phinnaeus 2 hours ago | parent | prev [-]

Here is a hash function that does not have hash collisions:

  fn hash(data):
    return data
Charon77 2 hours ago | parent | next [-]

Well it no longer constrains the data in a fixed output length.

dbdr an hour ago | parent [-]

Sure, but if you constrain to fixed output length, you will definitely have collisions (Pigeon Hole Principle). There's no way around that.

hsbauauvhabzb 28 minutes ago | parent [-]

padding with zeroes to a fixed length and prepending the original length would suffice, but you’d have to have a fixed length of double infinity to account for both the length information and the hash information, and the hash is less efficient than the original information.

oliver236 40 minutes ago | parent | prev [-]

what programming language is this?