Remix.run Logo
eterm 2 days ago

How I'd personally like to treat them:

  - Critical / Fatal:  Unrecoverable without human intervention, someone needs to get out of bed, now.
  - Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
  - Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )
The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.

arwhatever a day ago | parent | next [-]

I like to think of “warning” as something to alert on statistically, e.g. incorrect password attempt rate jumps from 0.4% of login attempts to 99%.

mzi 15 hours ago | parent | next [-]

This sounds more like metrics than a log statement.

For me logs should complement metrics, and can in many instances be replaced by tracing if the spans are annotated sufficiently. But making metrics out of logs is both costly and a bit brittle.

lanstin a day ago | parent | prev [-]

This point is important - the value of a log is inextricably tied to its unlikelihood. Which depends on so many things in the context.

bluGill 17 hours ago | parent [-]

The value of all logs is tied only to if there is a problem will it help you find and debug it. If you never do statistics that password log is useless. If you never encounter a problem where the log helps debug it was useless.

God doesn't tell you the future so good luck figuring out which logs you really need.

masswerk a day ago | parent | prev | next [-]

Also, warnings for ambiguous results.

For example, when a process implies a conversion according to the contract/convention, but we know that this conversion may be not the expected result and the input may be based on semantic misconceptions. E.g., assemblers and contextually truncated values for operands: while there's no issue with the grammar or syntax or intrinsic semantics, a higher level misconception may be involved (e.g., regarding address modes), resulting in a correct but still non-functional output. So, "In this individual case, there may be or may be not an issue. Please, check. (Not resolvable on our end.)"

(Disclaimer: I know that this is a very much classic computing and that this is now mostly moved to the global TOS, but still, it's the classic example for a warning.)

RaftPeople a day ago | parent | prev | next [-]

> The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".

What about conditions like "we absolutely knew this would happen regularly, but it's something that prevents the completion of the entire process which is absolutely critical to the organization"

The notion of an "error" is very context dependent. We usually use it to mean "can not proceed with action that is required for the successful completion of this task"

wizzwizz4 a day ago | parent [-]

Those conditions would be "Critical", no? The error vs warning distinction doesn't apply.

fhcuvyxu a day ago | parent [-]

No, many applications need to be fault tolerant.

Crashing your web stack because one route hit an error is a dumb idea.

And no, calling it a warning is also dumb idea. It is an error.

This article is a navel gazing expedition.

They're kind of right but you can turn any warning into an error and vice versa depending on business needs that outweigh the technical categorisation.

wizzwizz4 13 hours ago | parent [-]

A log entry marked "CRITICAL" does not imply crashing the web stack.

fhcuvyxu 7 hours ago | parent [-]

Right. Was thinking of fatal.

p2detar 19 hours ago | parent | prev | next [-]

Yea but instead of log Critical/Fatal and go on, I would just panic() the program. To the other definitions I agree - everything else is recoverable, because the program still runs.

Warning to me is an error that has very little business logic side effects/impact as opposed to an Error, but still requires attention.

IgorPartola 17 hours ago | parent [-]

I write a lot of backend web code that often talks to external services. So for example the user wants to add a shipping address to their profile but the address verification API responds with a 500. That is an expected error: sometimes it can happen. I want to log it but I do not want a trace back or anything like that.

On the other hand it could be that the API had changed slightly. Say they for some reason decided to rename the input parameter postcode to postal_code and I didn’t change my code to fix this. This is 100% a programming error that would be classified as critical but I would not want to panic() the entire server process over it. I just want an alert that hey there is a programming error, go fix it.

But what could also happen is that when I try to construct a request for the external API and the OS is out of memory. Then I want to just crash the process and rely on automatic process restarts to bring it back up. BTW logging an error after malloc() returns NULL needs to be done carefully since you cannot allocate more memory for things like a new log string.

mewpmewp2 2 days ago | parent | prev | next [-]

What if you are integrated to a third party app and it gives you 5xx once? What do you log it as, and let's say after a retry it is fine.

kiicia 2 days ago | parent | next [-]

As always „it depends”

- info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx

bqmjjx0kac 2 days ago | parent | prev | next [-]

I would log a warning when an attempt fails, and an error when the final attempt fails.

mewpmewp2 2 days ago | parent [-]

You are not the OP, but I think I was trying to point out this example case in relation to their descriptions of Error/Warnings.

This scenario may or may not yield in data/state loss, it may also be something that you, yourself can't immediately fix. And if it's temporary, what is the point of creating an issue and prioritizing.

I guess my point is that to any such categorization of errors or warnings there are way too many counter examples to be able to describe them like that.

So I'd usually think that Errors are something that I would heuristically want to quickly react to and investigate (e.g. being paged, while Warnings are something I would periodically check in (e.g. weekly).

wredcoll 2 days ago | parent [-]

Like so many things in this industry the point is establishing a shared meaning for all the humans involved, regardless of how uninvolved people think.

That being said, I find tying the level to expected action a more useful way to classify them.

mewpmewp2 2 days ago | parent [-]

But what I also see frequently is people trying to do the impossible and idealistic things because they read somewhere that something should mean X, when things are never so clearly cut, so either it is not such a simplistic issue and should be understood as not such a simple issue, or there might be a better more practical definition for it. We should first start from what are we using Logs for. Are we using those for debugging, or so we get alerted or both?

If for debugging, the levels seem relevant in the sense of how quickly we are able to use that information to understand what is going wrong. Out of potential sea of logs we want to see first what were the most likely culprits for something causing something to go wrong. So the higher the log level, the higher likelihood of this event causing something to go wrong.

If for alerting, they should reflect on how bad is this particular thing happening for the business and would help us to set a threshold for when we page or have to react to something.

marcosdumay 2 days ago | parent | prev | next [-]

Well, the GPs criteria are quite good. But what you should actually do depends on a lot more things than the ones you wrote in your comment. It could be so irrelevant to only deserve a trace log, or so important to get a warning.

Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.

cpburns2009 2 days ago | parent | prev | next [-]

It really depends on the third party service.

For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.

For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.

For service C, a 500 error may be an anomaly and treat it as hard error and log as error.

srdjanr a day ago | parent [-]

What's the difference between B and C? API being down seems like an anomaly.

Also, you can't know how frequently you'll get 500s at the time you're doing integration, so you'll have to go back after some time to revisit log severities. Which doesn't sound optimal.

IgorPartola 17 hours ago | parent [-]

Exactly. What’s worse is that if you have something like a web service that calls an external API, when that API goes down your log is going to be littered with errors and possibly even tracebacks which is just noise. If you set up a simple “email me on error” kind of service you will get as many emails as there were user requests.

In theory some sort of internal API status tracking thing would be better that has some heuristic of is the API up or down and the error rate. It should warn you when the API is down and when it comes back up. Logging could still show an error or a warning for each request but you don’t need to get an email about each one.

2 days ago | parent | prev | next [-]
[deleted]
eterm 2 days ago | parent | prev [-]

This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning.

Because what I'd want to know is how often does it fail, which is a metric not a log.

So expose <third party api failure rate> as a metric not a log.

If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.

If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.

mewpmewp2 2 days ago | parent | next [-]

> If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).

How do you define uptime? What if e.g. it's a social login / data linking and that provider is down? You could have multiple logins and your own e-mail and password, but you still might lose users because the provider is down. How do you log that? Or do you only put it as a metric?

You can't always easily replace providers.

ivan_gammel a day ago | parent [-]

You may log that or count failures in some metric, but the correct answer is to have a health check on third party service and an alert when that service is down. Logs may help to understand the nature of the incident, but they are not the channel through which you are informed about such problems.

The different issue is when third party broke the contract, so suddenly you get a lot of 4xx or 5xx responses, likely unrecoverable. Then you get ERROR level messages in the log (because it’s unexpected problem) and an alert when there’s a spike.

hk__2 2 days ago | parent | prev | next [-]

> This might be controversial, but I'd say if it's fine after a retry, then it doesn't need a warning. > > Because what I'd want to know is how often does it fail, which is a metric not a log.

It’s not controversial; you just want something different. I want the opposite: I want to know why/how it fails; counting how often it does is secondary. I want a log that says "I sent this payload to this API and I got this error in return", so that later I can debug if my payload was problematic, and/or show it to the third party if they need it.

hamandcheese a day ago | parent | prev [-]

My main gripe with metrics is that they are not easily discoverable like logs are. Even if you capture a list of all the metrics emitted from an application, they often have zero context and so the semantics are a bit hard to decipher.

sysguest 2 days ago | parent | prev [-]

hmm maybe we need extra representation?

eg: 2.0 for "trace" / 1.0 for "debug" / 0.0 for "info" / -1.0 for "warn" / -2.0 for "error that can be handled"

wredcoll 2 days ago | parent [-]

I said this elsewhere, but the point here is what the humans involved are supposed to do with this info. Do I literally get out of bed on an error log or do I grep for them once or twice a month?

ivan_gammel a day ago | parent [-]

You should never get out of bed on an error in the log. Logs are for retrospective analysis, health checks and metrics are for situational awareness, alerts are for waking people up.