The power of two random choices (2012)

jauntywundrkind 4 days ago | parent | next [-]

(2012)

Amazing technique. Previous submissions, and another good one on load balancing via PoTRC

https://news.ycombinator.com/item?id=39283595 https://news.ycombinator.com/item?id=24877341 https://news.ycombinator.com/item?id=37143376

	▲	wonger_ 4 days ago \| parent [-]
		A helpful visualization, from a past discussion: https://xcancel.com/grantslatton/status/1754912113246798036

▲

miggy 3 days ago | parent | prev | next [-]

Willy from HAProxy has a good write-up on this. In their benchmarks, least-connections usually beat P2C, but P2C was never the worst and is arguably a saner default when least-connections isn’t available. The article link: https://www.haproxy.com/blog/power-of-two-load-balancing

▲

phyzome 4 days ago | parent | prev | next [-]

I found less-than-great results in a simulation where there's a slight persistent difference between two of the options: https://www.brainonfire.net/blog/2019/07/21/load-balancing-b... (as part of a larger study on healthchecks that Don't Suck).

▲

yuliyp 3 days ago | parent | next [-]

I think that simulation claim that pick-2 can send 2.5x as much traffic to most loaded vs least loaded is a bit misleading: if the load metric is completely random then that might happen. The more correlation to load the better. Also, rather than looking at the ratio of most loaded to least loaded, it might be better to look at the ratio of most loaded to average: that is, how much extra work did we send to a poor server. In that, pick-2 has an absolute cap of 2xing the load on a server.

	▲	phyzome 2 days ago \| parent [-]
		Real world case where I've observed these load characteristics: A cluster of three Redis nodes, one of which is primary and therefore has slightly (but persistently) worse latency. Pick-2 would send significantly less read traffic to that node. Like you say, it's no worse than a 2x difference, but I'd prefer better balancing than that. (Pick-2 also can at most give 2x less traffic to a node with terrible performance, which is not awesome.)

▲

miggy 3 days ago | parent | prev [-]

Excellent read. It highlights key aspects like health checks, server restarts, warm up, and load shedding, all of which make load balancing an already hard problem even harder.

▲

reilly3000 4 days ago | parent | prev | next [-]

Incidentally this makes me think about how little I’ve needed to think about load balancing for a long time. It’s one of those cloud primitives that make sense as a default for most use cases and just works.

▲

atorodius 4 days ago | parent | prev | next [-]

neat trick indeed. would be cool to do the math and get an analytical formula of mean queue time given cache refresh for a given k, under some mild assumptions.

	▲	akoboldfrying 4 days ago \| parent [-]
		That may have been done in the underlying paper by Mitzenmacher et al., but I haven't checked. I'm more confident that that paper established that firing n requests at n servers will result in a max server load proportional to log(log(n)) with high probability, vs. proportional to log(n) for random -- IOW an exponential improvement in max server load over random.

▲

4 days ago | parent | prev [-]

[deleted]