An upgrade process involves heavy CPU use, disk read/writes, and at least a few power cycles in short time period. Depending what OP was doing on it otherwise, it could've been the highest temperature the device had ever seen. It's not so crazy.

My guess would've been SSD failure, which would make sense to seem to appear after lots of writes. In the olden days I used to cross my fingers when rebooting spinning disk servers with very long uptimes because it was known there was a chance they wouldn't come back up even though they were running fine.

▲

jonathanlydall 2 days ago | parent | next [-]

Not for a server, but many years ago my brother had his work desktop fail after he let it cold boot for the first time in a very long time.

Normally he would leave his work machine turned on but locked when leaving the office.

Office was having electrical work done and asked that all employees unplug their machines over the weekend just in case of a surge or something.

On the Monday my brother plugged in machine and it wouldn’t turn on. Initially the IT guy remarked that my brother didn’t follow the instructions to unplug it.

He later retracted the comment after it was determined the power supply capacitors had gone bad a while back, but the issue with them was not apparent until they had a chance to cool down.

▲

GCUMstlyHarmls 2 days ago | parent | prev | next [-]

> In the olden days I used to cross my fingers when rebooting spinning disk servers with very long uptimes because it was known there was a chance they wouldn't come back up even though they were running fine.

HA! Not just me then!

I still have an uneasy feeling in my guts doing reboots, especially on AM5 where the initial memory timing can take 30s or so.

I think most of my "huh, its broken now?" experiences as a youth were probably the actual install getting wonky though, rather than the few rare "it exploded" hardware failures after reboot, though that definitely happened.

▲

zelon88 2 days ago | parent | prev | next [-]

This, 100%.

I'd like to add my reasoning for a similar failure of an HP Proliant server I encountered.

Sometimes hardware can fail during long uptime and not become a problem until the next reboot. Consider a piece of hardware with 100 features. During typical use, the hardware may only use 50 of those features. Imagine one of the unused features has failed. This would not cause a catastrophic failure during typical use, but on startup (which rarely occurs) that feature is necessary and the system will not boot without it. If it could, it could still perform it's task... because the damaged feature is not needed. But it can't get past the boot phase, where the feature is required.

Tl;dr the system actually failed months ago and the user didn't notice because the missing feature was not needed again until the next reboot.

▲

startupsfail 2 days ago | parent [-]

Is there a good reason why upgrades need to stress-test the whole system? Can't they go slowly, throttling resource usage to background levels?

They involve heavy CPU use, stress the whole system completely unnecessary, the system easily sees the highest temperature the device had ever seen during these stress tests. If during that strain something fails or gets corrupted, it's a system-level corruption...

Incidentally, Linux kernel upgrades are not better. During DKMS updates the CPU load skyrockets and then a reboot is always sketchy. There's no guarantee that something would not go wrong, a secure boot issue after a kernel upgrade in particular could be a nightmare.

▲

zelon88 2 days ago | parent [-]

To answer your question; it helps to explain what the upgrade process entails.

In the case of Linux DKMS updates: DKMS is re-compiling your installed kernel modules to match the new kernel. Sometimes a kernel update will also update the system compiler. In that instance it can be beneficial for performance or stability to have all your existing modules recompiled with the new version of the compiler. The new kernel comes with a new build environment, which DKMS uses to recompile existing kernel modules to ensure stability and consistency with that new kernel and build system.

Also, kernel modules and drivers may have many code paths that should only be run on specific kernel versions. This is called 'conditional compilation' and it is a technique programmers use to develop cross platform software. Think of this as one set of source code files that generates wildly different binaries depending on the machine that compiled it. By recompiling the source code after the new kernel is installed, the resulting binary may be drastically different than the one compiled by the previous kernel. Source code compiled on a 10 year old kernel might contain different code paths and routines than the same source code that was compiled on the latest kernel.

Compiling source code is incredibly taxing on the CPU and takes significantly longer when CPU usage is throttled. Compiling large modules on extremely slow systems could take hours. Managing hardware health and temperatures is mostly a hardware level decision controlled by firmware on the hardware itself. That is usually abstracted away from software developers who need to be able to be certain that the machine running their code is functional and stable enough to run it. This is why we have "minimum hardware requirements."

Imagine if every piece of software contained code to monitor and manage CPU cooling. You would have software fighting each other over hardware priorities. You would have different systems for control, with some more effective and secure than others. Instead the hardware is designed to do this job intrinsically, and developers are free to focus on the output of their code on a healthy, stable system. If a particular system is not stable, that falls on the administrator of that system. By separating the responsibility between software, hardware, and implementation we have clear boundaries between who cares about what, and a cohesive operating environment.

	▲	startupsfail a day ago \| parent [-]
		The default could be that a background upgrade should not be a foreground stress test. Imagine you are driving a car and from time ro time, without any warning, it suddenly starts accelerating and decelerating aggressively. Your powertrain, engine, breaks are getting tear and wear, oh and at random that car also spins out and rolls, killing everyone inside (data loss). This is roughly how current unattended upgrades work.

▲

SecretDreams 2 days ago | parent | prev [-]

> Depending what OP was doing on it otherwise, it could've been the highest temperature the device had ever seen. It's not so crazy.

Kind of big doubt. This was probably not slamming the hardware.

	▲	refulgentis 2 days ago \| parent [-]
		That was absolutely slamming the hardware. (source: worked on Android, and GPs comments re: this are 100% correct. I’d need a bit more, well anything, to even come around to the idea the opposite is even plausible. Best steelman is naïvete, like “aren’t updates are just a few mvs and a reboot?”)