Remix.run Logo
simoncion 2 hours ago

You will love SystemD [0] timers until they fuck you over in an entirely inscrutable way and the SystemD maintainers don't care to either fix the problem or update the docs to warn of the shortcoming.

One of our customers called in with a production down incident caused by a full disk. We got a copy of the VM and took a look. Investigation revealed that / was full because /var/log was full and that our 'logrotate' timer unit that was scheduled to run once a day had run either exactly never or exactly once... I can't remember which. Further investigation revealed no difference in software load or configuration between this VM and a VM that had a functional logrotate timer unit. Exactly one VM out of hundreds of identical VMs at this site (and many multiples of that at other customer's sites) were affected by this. Advising the customer to clear out /var/log and reboot did not unstick 'logrotate', and none of the diagnostics or fixes we could find anywhere unstuck it. Once "systemd-crond" decided to never schedule this job ever again, it stuck to that decision.

After a lot of searching, we found an open bug report from a year or three prior where someone reported exactly the same symptoms and was scheduling a unit with pretty much the same set of unit configuration flags that we were using. The conversation from the core devs ran through the pattern that one gets used to seeing when one runs into SystemD bugs that are caused by extremely complex unanticipated interactions between parts of the project: "That's not a bug, only an idiot would want that to work.", "Oh, we don't document that that's not supposed to work?", "Wow, okay, yeah, I can see how that maybe should work. That it doesn't sure does seem weird.", "Having said that, I don't know if it's supposed to work, or if it's unsupported. Someone should really either document that or fix it."... and then the behavior is neither fixed nor documented. [1] Absent any actual explanation for the failure, we ended up swizzling the options in our 'logrotate' unit and praying that satisfied whatever gremlin arose from the depths to trouble our customer.

SystemD contains an enormous -and ever-growing- amount of accidental complexity, and has a set of core maintainers who are generally disinterested in either documenting the places where one or more complex systems bind together to cause stop-the-world problems or fixing the systems involved so that they don't bind up. It's a fine project until it's very, very suddenly not, and then you're absolutely SOL. If you're lucky, you can shuffle around what you're doing [2] and hope that avoids the problem. [3]

[0] Some folks use the spelling "SystemD" to mock the project. I use the spelling "SystemD" to distinguish between "the entire systemd project" and systemd(1). I do this because some folks will make a claim like "systemd is very, very small and self-contained. I don't understand why anyone would say otherwise.", but what they are actually saying is that systemd(1) is a fairly small program that doesn't do all that much when run as PID 1. It sucks minor amounts of ass that the project and the program it runs as PID 1 share the same name, but what can you do?

[1] No, I don't have a link to the open bug report. This was more than a year ago, so the bug ID has been long forgotten.

[2] The term of art for this practice is "wave a dead chicken at it".

[3] Plus, like, even disregarding most of the rest of my report... how in the hell do you design a cron that knows a job is scheduled to be run periodically, can tell you how long it has been since it last ran, but never manages to run it? To me, that's unforgivable. It's a "You had one job!"-tier cockup.