Remix.run Logo
IgorPartola 8 hours ago

See I would not reboot the server first before figuring out what is happening. You lose a lot of info by doing that and the worst thing that can happen is that the problem goes away for a little bit.

gerdesj 7 hours ago | parent | next [-]

To be fair, turning it off and on again is unreasonably effective.

I recently diagnosed and fixed an issue with Veeam backups that suddenly stopped working part way through the usual window and stopped working from that point on. This particular setup has three sites (prod, my home and DR), and five backup proxies. Anyway, I read logs and Googled somewhat. I rebooted the backup server - no joy, even though it looked like the issue was there. I restarted the proxies and things started working again.

The error was basically: there are no available proxies, even though they were all available (but not working but not giving off "not working" vibes).

I could bother with trying to look for what went wrong but life is too short. This is the first time that pattern has happened to me (I'll note it down mentally and it was logged in our incident log).

So, OK, I'll agree that a reboot should not generally be the first option. Whilst sciencing it or nerding harder is the purist approach, often a cheeky reboot gets the job done. However, do be aware that a Windows box will often decide to install updates if you are not careful 8)

rurban 39 minutes ago | parent | next [-]

Turning it off and on again is risky. I recently upgraded a robot in Australia, had problems with systemd, so I turned it off. And had to wait a few weeks until it could be turned on again, because tailscaled was not setup persistently, the routing was not setup properly (over a phone), the machine had some problems,...

High risk, low reward. But of course the ultimate test if it's properly setup.

But on the other hand, with my tiny hard real-time embedded controllers, a power cycle is the best option. No persistent state, fast power up, reboot in milliseconds. Every little SW error causes a reboot, no problem at all.

rurban 44 minutes ago | parent | prev | next [-]

Turning it off and on again is risky. I recently upgraded a robot in Australia, had problems with systemd, so I turned it off. And had to wait a few weeks until it could be turned on again, because tailscaled was not setup persistently, the routing was not setup properly (over a phone), the machine had some problems,...

High risk, low reward. But of course the ultimate test if it's properly setup.

akerl_ 6 hours ago | parent | prev [-]

No, you didn’t diagnose and fix an issue.

You just temporarily mitigated it.

abrookewood 5 hours ago | parent [-]

Sometimes that is enough - especially for home machines etc.

akerl_ 5 hours ago | parent [-]

I’ve got no problem with somebody choosing to mitigate something instead of fixing it. But it’s just incorrect to apply a blind mitigation and declare that you’ve diagnosed the problem.

butvacuum 2 hours ago | parent [-]

what's the ROI on that?

-- leadership

galleywest200 7 hours ago | parent | prev | next [-]

My job as a DevOps engineer is to ensure customer uptime. If rebooting is the fastest, we do that. Figuring out the why is the primary developers’ jobs.

This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

My job may be different than other’s as I work at an ITSP and we serve business phone lines. When business phones do not work it is immediately clear to our customers. We have to get them back up not just for their business but for the ability for them to dial 911.

tolciho 3 hours ago | parent [-]

> This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

Unless, hypothetically, the logging velocity tickles kernel bugs and crashes the system, but only when the daemon is started from cron and not elsewhere. Hypothetically, of course.

Or when the system stops working two weeks after launch because "logging everything" has filled up the disk, and took two weeks to so do. This also means important log messages (perhaps that the other end is down) might be buried in 200 lines of log noise and backtrace spam per transaction, which in turn might delay debugging and fixing or at isolating at which end of the tube the problem resides.

butvacuum 7 hours ago | parent | prev | next [-]

most failstates arent worth preserving in a SMB environment. In larger environments or ones equipped for it a snapshot can be taken before rebooting- should the issue repeat.

Once is chance, twice is coincidence, three times makes a pattern.

Ferret7446 6 hours ago | parent [-]

Alternatively, if it doesn't happen again it's not worth fixing, if it does happen again then you can investigate it when it happens again.

ValdikSS 7 hours ago | parent | prev [-]

I've debugged so many issues in my life that sometimes I'd prefer things to just work, and if reboot helps to at least postpone the problem, I'd choose that :D

butvacuum 2 hours ago | parent [-]

seriously, and sometimes it's just not worth investigating. which means its never going to get fixed, and I'd rather go home than create another ticket that'll just get stale and age out.