| ▲ | btilly 5 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||
When I was at Google, on an SRE team, here is the explanation that I was given. Early on Google used dynamic libraries. But weird things happen at Google scale. For example Google has a dataset known, for fairly obvious reasons, as "the web". Basically any interesting computation with it takes years. Enough to be a multiple of the expected lifespan of a random computer. Therefore during that computation, you have to expect every random thing that tends to go wrong, to go wrong. Up to and including machines dying. One of the weird things that becomes common at Google scale, are cosmic bit flips. With static binaries, you can figure out that something went wrong, kill the instance, launch a new one, and you're fine. That machine will later launch something else and also be fine. But what happens if there was a cosmic bit flip in a dynamic library? Everything launched on that machine will be wrong. This has to get detected, then the processes killed and relaunched. Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! Often the killed process will relaunch on the bad machine, failing again! This will continue until someone reboots the machine. Static binaries are wasteful. But they aren't as problematic for the infrastructure as detecting and fixing this particular condition. And, according to SRE lore circa 2010, this was the actual reason for the switch to static binaries. And then they realized all sorts of other benefits. Like having a good upgrade path for what would normally be shared libraries. | |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | ambrosio 3 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||
> But what happens if there was a cosmic bit flip in a dynamic library? I think there were more basic reasons we didn't ship shared libraries to production. 1. They wouldn't have been "shared", because every program was built from its own snapshot of the monorepo, and would naturally have slightly different library versions. Nobody worried about ABI compatibility when evolving C++ interfaces, so (in general) it wasn't possible to reuse a .so built at another time. Thus, it wouldn't actually save any disk space or memory to use dynamic linking. 2. When I arrived in 2005, the build system was embedding absolute paths to shared libraries into the final executable. So it wasn't possible to take a dynamically linked program, copy it to a different machine, and execute it there, unless you used a chroot or container. (And at that time we didn't even use mount namespaces on prod machines.) This was one of the things we had to fix to make it possible to run tests on Forge. 3. We did use shared libraries for tests, and this revealed that ld.so's algorithm for symbol resolution was quadratic in the number of shared objects. Andrew Chatham fixed some of this (https://sourceware.org/legacy-ml/libc-alpha/2006-01/msg00018...), and I got the rest of it eventually; but there was a time before GRTE, when we didn't have a straightforward way to patch the glibc in prod. That said, I did hear a similar story from an SRE about fear of bitflips being the reason they wouldn't put the gws command line into a flagfile. So I can imagine it being a rationale for not even trying to fix the above problems in order to enable dynamic linking. > Since this keeps happening, that machine is always there lightly loaded, ready for new stuff to launch. New stuff that...wind up broken for the same reason! I did see this failure mode occur for similar reasons, such as corruption of the symlinks in /lib. (google3 executables were typically not totally static, but still linked libc itself dynamically.) But it always seemed to me that we had way more problems attributable to kernel, firmware, and CPU bugs than to SEUs. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dh2022 5 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||
In Azure - which I think is at Google scale - everything is dynamically linked. Actually a lot of Azure is built on C# which does not even support static linking... Statically linking being necessary for scaling does not pass the smell test for me. | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||