| ▲ | lenkite 7 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Maybe I am missing something, but why didn't they just leverage dynamic libraries ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | btilly 5 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
When I was at Google, on an SRE team, here is the explanation that I was given. Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying. One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine. But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine. Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | tmoertel 7 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
One reason is that using static binaries greatly simplifies the problem of establishing Binary Provenance, upon which security claims and many other important things rely. In environments like Google’s it's important to know that what you have deployed to production is exactly what you think it is. See for more: https://google.github.io/building-secure-and-reliable-system... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||