> Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)

So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?

▲

raldi 2 days ago | parent [-]

No; I’m not understanding your point about guessing. Could you restate?

As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.

▲

electroly 2 days ago | parent | next [-]

I think OP is making two separate but related points, a general point and a specific point. Both involve guessing something that the error-handling code, on the spot, might not know.

1. When I personally see database timeouts at work it's rarely the database's fault, 99 times out of 100 it's the caller's fault for their crappy query; they should have looked at the query plan before deploying it. How is the error-handling code supposed to know? I log timeouts (that still fail after retry) as errors so someone looks at it and we get a stack trace leading me to the bad query. The database itself tracks timeout metrics but the log is much more immediately useful: it takes me straight to the scene of the crime. I think this is OP's primary point: in some cases, investigation is required to determine whether it's your service's fault or not, and the error-handling code doesn't have the information to know that.

2. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will classify a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.

I think the more general point (that you should avoid logging errors for things that aren't your service's fault) stands. It's just tricky in some cases.

	▲	lanstin a day ago \| parent [-]
		Also everywhere I have worked there are transient network glitches from time to time. Timeout can often be caused by these.

▲

makeitdouble 2 days ago | parent | prev | next [-]

> the database is owned by a separate oncall rotation

Not OP, but this part hits the same for me.

In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.

But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.

▲

raldi 2 days ago | parent [-]

I agree that you should detect this, just through a metric rather than putting DB timeouts in the ERROR loglevel.

	▲	makeitdouble a day ago \| parent [-]
		But what's the base of your metric ? I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs. But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it. On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go. I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.

▲

2 days ago | parent | prev | next [-]

[deleted]

▲

Hizonner a day ago | parent | prev [-]

> No; I’m not understanding your point about guessing. Could you restate?

In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.

Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.

I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.

As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.