I’m sure this work is very impressive, but these QPS numbers don’t seem particularly high to me, at least compared to existing horizontally scalable service patterns. Why is it hard for the kube control plane to hit these numbers?

For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.

Not trying to be reductive - I’m sure there’s some complexity here I’m missing!

▲

phrotoma 6 hours ago | parent | next [-]

I am extremely Not A Database Person but I understand that the rationale for Kubernetes adopting etcd as its preferred data store was more about its distributed consistency features and less about query throughput. etcd is slower cause it's doing RAFT things and flushing stuff to disk.

Projects like kine allow K8s users to swap sqlite or postgres in place of etcd which (I assume, please correct me otherwise) would deliver better throughput since those backends don't need to perform consenus operations.

https://github.com/k3s-io/kine

▲

dijit 5 hours ago | parent | next [-]

You might not be a database person, but you’re spot on.

A well managed HA postgresql (active/passive) is going to run circles around etcd for kube controlplane operations.

The caveat here is increased risk of downtime, and a much higher management overhead, which is why its not the default.

▲

Sayrus 5 hours ago | parent | prev [-]

GKE uses Spanner as an etcd replacement.

▲

ZeroCool2u 5 hours ago | parent [-]

But, and I'm honestly asking, you as a GKE user don't have to manage that spanner instance, right? So, you should in theory be able to just throw higher loads at it and spanner should be autoscaling?

▲

DougBTX 4 hours ago | parent [-]

Yes, from the article:

> To support the cluster’s massive scale, we relied on a proprietary key-value store based on Google’s Spanner distributed database... We didn’t witness any bottlenecks with respect to the new storage system and it showed no signs of it not being able to support higher scales.

	▲	ZeroCool2u 3 hours ago \| parent [-]
		Yeah, I guess my question was a bit more nuanced. What I was curious about was if they were fully relying on normal autoscaling that any customer would get or were they manually scaling the spanner instance in anticipation of the load? I guess it's unlikely we're going to get that level of detailed info from this article though.

▲

PunchyHamster 4 hours ago | parent | prev | next [-]

it's not really bottlenecked by the store but by the calculations performed on each pod schedule/creation.

It's basically "take global state of node load and capacity, pick where to schedule it", and I'd imagine probably not running in parallel coz that would be far harder to manage.

▲

senorrib 3 hours ago | parent | next [-]

No a k8s dev, but I feel like this is the answer. K8s isn't usually just scheduling pods round robin or at random. There's a lot of state to evaluate, and the problem of scheduling pods becomes an NP-hard problem similar to bin packing problem. I doubt the implementation tries to be optimal here, but it feels a computationally heavy problem.

	▲	OvervCW 2 hours ago \| parent [-]
		In what way is it NP-hard? From what I can gather it just eliminates nodes where the pod wouldn't be allowed to run, calculates a score for each and then randomly selects one of the nodes that has the lowest score, so trivially parallelizable.

▲

__turbobrew__ 2 hours ago | parent | prev [-]

The k8s scheduler lets you tweak how many nodes to look at when scheduling a pod (percentage of nodes to score) so you can change how big “global state” is according to the scheduler algorithm.

▲

nonameiguess 2 hours ago | parent | prev [-]

It says in the blog that they require 13,000 queries per second to update lease objects, not that 13,000 is the total for all queries. I don't know why they cite that instead of total, but etcd's normal performance testing indicates it can handle at least 50,000 writes per second and 180,000 reads: https://etcd.io/docs/v3.6/op-guide/performance/. So, without them saying what the real number is, I'm going to guess their reads and writes outside of lease updates are at least much larger than those numbers.