Surely a 50% warning alarm on disk usage covers this without manual intervention?

theshrike79 a day ago | parent | next [-]

Depends. A Kubernetes container might have only a few megabytes of disk space, because it shouldn't need it.

Except that one time when .NET decides that the incoming POST is over some magic limit and it doesn't do the processing in-memory like before, but instead has to write it to disk, crashing the whole pod. Fun times.

Also my Unraid NAS has two drives in "WARNING! 98% USED" alert state. One has 200GB of free space, the other 330GB. Percentages in integers don't work when the starting number is too big :)

	▲	majormajor 6 hours ago \| parent [-]
		The "ballast file" idea doesn't really change that spill-to-disk crash, as far as I can tell. You have to delete it manually; it already crashed by the time you realize it. Seems like the sort of thing that only makes sense in a "I know my cheapskate boss won't have larger drives ready to go (or be willing to pay to expand it in a cloud scenario), and he insists that the alarm not go off until 95%, but it'll be my fault if we have a bad incident we can't recover quickly from, so I'm gonna give myself some headroom by padding things a bit" extra-paranoid scenario.

▲

coredog64 19 hours ago | parent | prev | next [-]

You don't want an alarm on a usage threshold, you want a linear regression that predicts when utilization will cross a threshold. Then you set your alarms for "How long does it take me to remediate this condition?"

	▲	dotancohen 4 hours ago \| parent [-]
		That's far more complicated and fragile. Where are you storing this log of disk usage? If you already have some external time series database then this is already a solved problem. But for a single server, desktop, or embedded device you'll need a database or text log, a cron job to measure it, and another script to parse, make predictions, and then raise alerts. And a single large dump to disk, like some daemon suddenly bugging out and writing incessantly to logs, will render all that moot anyway.

▲

jcims a day ago | parent | prev | next [-]

If the alarms are reliably configured, confirmed to be working, low noise enough to be actioned, etc etc.

And of course there's nothing to say that both of these things can't be done simultaneously.

▲

evil-olive 17 hours ago | parent | prev | next [-]

> Surely a 50% warning alarm on disk usage covers this without manual intervention?

surely you don't need a fire extinguisher in your kitchen, if you have a smoke detector?

a "warning alarm" is a terrible concept, in general. it's a perfect way to lead to alert fatigue.

over time, you're likely to have someone silence the alarm because there's some host sitting at 57% disk usage for totally normal reasons and they're tired of getting spammed about it.

even well-tuned alert rules (ones that predict growth over time rather than only looking at the current value) tend to be targeted towards catching relatively "slow" leaks of disk usage.

there is always the possibility for a "fast" disk space consumer to fill up the disk more quickly than your alerting system can bring it to your attention and you can fix it. at the extreme end, for example, a standard EBS volume has a throughput of 125mb/sec. something that saturates that limit will fill up 10gb of free space in 80 seconds.

▲

ssl-3 8 hours ago | parent | next [-]

50% is probably unrealistic. Nobody really wants to diminish their storage by 50%.

Let's set a fixed threshold -- 100GB, say -- and play out both methods.

Method A: One or more ballast files are created, totalling 100GB. The machine runs out of storage and grinds to a halt. Hopefully someone notices soon or gets a generic alert that it has ceased, remembers that there's ballast files, and deletes one or more of them. They then poke it with a stick and get it going again, and set forth to resolve whatever was causing the no-storage condition (adding disk, cleaning trash, or whatever).

Method B: A specific alert that triggers with <100GB of free space. Someone sees this alert, understands what it means (because it is descriptive instead of generic), and logs in to resolve the low-storage condition (however that is done -- same as Method A). There is no stick-poking.

Method C: The control. We do nothing, and run out of space. Panic ensues. Articles are written.

---

Both A and B methods have an equal number of alerts for each low-disk condition (<100GB). Both methods work, in that they can form the impetus to free up some space.

But Method A relies on a system to crash, while Method B does not rely upon a crash at all.

I think that the lack of crash makes Method B rather superior all on its own.

(Method C sucks.)

▲

tempestn 3 hours ago | parent [-]

A + B would be best. Warn at 200, file to reserve the last 100 (or 50 or whatever). That way if the fill is too fast to react to in time, you still have a quick way to temporarily gain disk space, if needed to solve the problem.

	▲	ssl-3 3 hours ago \| parent [-]
		I like that idea. Belt and suspenders. Alerting on an unexpectedly high rate-of-change, as some others have suggested, also seems good for some workloads.

▲

majormajor 6 hours ago | parent | prev [-]

How does the ballast file prevent extreme runaway? You ain't gonna notice and delete it that quickly.

	▲	tempestn 3 hours ago \| parent [-]
		It doesn't prevent it. It gives you a way to potentially recover after the disk fills. Many operations become impossible once the disk is full, so this buys you some temporary breathing room to solve the problem.

▲

dspillett a day ago | parent | prev [-]

If the alarm works. And it actioned not just snoozed too much or just dismissed entirely.

Defence in depth is a good idea: proper alarms, and a secondary measure in case they don't have the intended effect.

▲

pixl97 21 hours ago | parent | next [-]

Alarms are great, but when something goes wrong SSDs can fill up amazingly fast!

▲

n4r9 21 hours ago | parent | prev | next [-]

Surely there are pitfalls either way. A ballast file can be deleted too readily, or someone could forget to re-add it.

	▲	dspillett 4 minutes ago \| parent [-]
		Yep. That is why doing both can be beneficial. Alerts are more proactive if acted upon, but often too easy to ignore meaning ballast is more fail-safe in that respect.

▲

jamiemallers 20 hours ago | parent | prev [-]

[dead]