That is not how hashtables work. Hash tables effectively randomize keys into buckets. Whether there is a hash collision is independent of the workload, and only has to do with the number of entries in the hash table vs the threshold to resize it.

This is what I meant by pathological data. If you had weird data that somehow, over and over, got hash tables right up to their fill limit and just stayed there then you would have slightly more collisions than an average use case. However you would have to construct weird data to do this. It wouldn't be "more ingestion". The thing PGO optimized would be individual branches in the hash table code, so for example the branching instruction to see if the key in a bucket is == to the key you are looking for.

▲

menaerus 2 days ago | parent [-]

> Whether there is a hash collision is independent of the workload,

In read-only workload, searching through the key-value space there can't be a collision, no?

▲

ltbarcly3 2 days ago | parent [-]

I think you should read and understand how a hash table works before you have an opinion on how they work.

▲

menaerus 2 days ago | parent [-]

You're incredibly rude and also clueless but I'm nonetheless giving you a benefit of a doubt to understand if there is something that I'm missing. But you're right, this is now way over the top for your head.

▲

ltbarcly3 2 days ago | parent [-]

I'm sorry for being rude. To address your question, yes, in fact all hash table collisions happen during the search key matching phase, if you study the collision resolution strategies on the wikipedia page you will see this.

https://en.wikipedia.org/wiki/Hash_table#Collision_resolutio...

	▲	menaerus a day ago \| parent [-]
		Ok, I can see how I didn't manage to fully explain myself - by replying to your "collisions" argument I actually meant something more than that and while it seemed obvious in my head what I wanted to say, I can agree that my wording was imprecise. Sorry about that. What I pictured in my head is that the ingestion workloads in contrast to the read-only workloads are different in the fact that collisions in former can trigger hash-map resizing/rehashing and which is why I say that these are two distinctive workloads. Moving bunch of data to and from the buckets. So, my hypothesis is the following: if I PGO the implementation using the dataset (volume, data_distribution) that it triggers a handful of such resizing/rehashing operations, and we observe the positive improvements in wall-time runtime performance, what are the chances that using the same binary but now over a completely different dataset will preserve the same runtime performance improvements? What you're saying, if I understand correctly, is that the improvements will be the same regardless of the dataset (and even type of workload) and I am saying I am not so sure about that.