I agree with this statement: "Instead of logging what your code is doing, log what happened to this request." but the impression I can't shake is that this person lacks experience, or more likely has a lot of experience doing the same thing over and over.

"Bug parts" (as in "acceptable number of bug parts per candy bar") logging should include the precursors of processing metrics. I think what he calls "wide events" I call bug parts logging in order to emphasize that it also may include signals pertaining to which code paths were taken, how many times, and how long it took.

Logging is not metrics is not auditing. In particular processing can continue if logging (temporarily) fails but not if auditing has failed. I prefer the terminology "observables" to "logging" and "evaluatives" to "metrics".

In mature SCADA systems there is the well-worn notion of a "historian". Read up on it.

A fluid level sensor on CANbus sending events 10x a second isn't telling me whether or not I have enough fuel to get to my destination (a significant question); however, that granularity might be helpful for diagnosing a stuck sensor (or bad connection). It would be impossibly fatiguing and hopelessly distracting to try to answer the significan question from this firehose of low-information events. Even a de-noised fuel gauge doesn't directly diagnose my desired evaluative (will I get there or not?).

Does my fuel gauge need to also serve as the debugging interface for the sensor? No, it does not. Likewise, send metrics / evaluatives to the cloud not logging / observables; when something goes sideways the real work is getting off your ass and taking a look. Take the time to think about what that looks like: maybe that's the best takeaway.

▲

otterley 7 hours ago | parent [-]

> Logging is not metrics is not auditing.

I espouse a "grand theory of observability" that, like matter and energy, treats logs, metrics, and audits alike. At the end of the day, they're streams of bits, and so long as no fidelity is lost, they can be converted between each other. Audit trails are certainly carried over logs. Metrics are streams of time-series numeric data; they can be carried over log channels or embedded inside logs (as they often are).

How these signals are stored, transformed, queried, and presented may differ, but at the end of the day, the consumption endpoint and mechanism can be the same regardless of origin. Doing so simplifies both the conceptual framework and design of the processing system, and makes it flexible enough to suit any conceivable set of use cases. Plus, storing the ingested logs as-is in inexpensive long-term archival storage allows you to reprocess them later however you like.

▲

lll-o-lll 7 hours ago | parent | next [-]

Auditing is fundamentally different because it has different durability and consistency requirements. I can buffer my logs, but I might need to transact my audit.

▲

otterley 7 hours ago | parent | next [-]

For most cases, buffering audit logs on local storage is fine. What matters is that the data is available and durable somewhere in the path, not that it be transactionally durable at the final endpoint.

▲

lll-o-lll 3 hours ago | parent [-]

What are we defining as “audit” here? My experience is with regulatory requirements, and “durable” on local storage isn’t enough.

In practice, the audit isn’t really a log, it’s something more akin to database record. The point is that you can’t filter your log stream for audit requirements.

▲

otterley 3 hours ago | parent [-]

Take Linux kernel audit logs as an example. So long as they can be persisted to local storage successfully, they are considered durable. That’s been the case since the audit subsystem was first created. In fact, you can configure the kernel to panic as soon as records can no longer be recorded.

Regulators have never dictated where auditable logs must live. Their requirement is that the records in scope are accurate (which implies tamper proof) and that they are accessible. Provided those requirements are met, where the records can be found is irrelevant. It thus follows that if all logs over the union of centralized storage and endpoint storage meet the above requirements then it will satisfy the regulator.

▲

lll-o-lll 2 hours ago | parent [-]

> Regulators have never dictated where auditable logs must live.

That’s true. They specify that logs cannot be lost, available for x years, not modifiable, accessible only in y ways, cannot cross various boundaries/borders (depending on gov in question). Or bad things will happen to you (your company).

In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine”. You need to know that the record has been made durable (across whatever your durability mechanism is, for example a DB with HA + DR), before progressing to the next step. Depending on the stringency, RPO needs to be zero for audit, which is why I say it is a special case.

I don’t know anything about linux audit, I doubt it has any relevance to regulatory compliance.

	▲	otterley an hour ago \| parent [-]
		> In practice, this means that durability of that audit record “a thing happened” cannot be simply “I persisted it to disk on one machine” As long as the record can be located when it is sought, it does not matter how many copies there are. The regulator will not ask so long as your system is a reasonable one. Consider that technologies like RAID did not exist once upon a time, and backup copies were latent and expensive. Yet we still considered the storage (which was often just a hard copy on paper) to be sufficient to meet the applicable regulations. If a fire then happened and burned the place down, and all the records were lost, the business would not be sanctioned so long as they took reasonable precautions. Here, I’m not suggesting that “the record is on a single disk, that ought to be enough.” I am assuming that in the ordinary course of business, there is a working path to getting additional redundant copies made, but those additional copies are temporarily delayed due to overload. No reasonable regulator is going to tell you this is unacceptable. > Depending on the stringency, RPO needs to be zero for audit And it is! The record is either in local storage or in central storage.

▲

chickensong 6 hours ago | parent | prev [-]

You could have the log shipper filter events and create a separate audit stream with different behavior and destination.

▲

cluckindan 5 hours ago | parent [-]

Really, have sane log message types and include ”audit” as one of them.

Log levels could be considered an anti-pattern.

	▲	hu3 3 hours ago \| parent [-]
		I like this. But doesn't it make sense to categorize en Exception thrown as an erro somehow? And a new user registration as an email info? Perhaps use tags then?

▲

Veserv 7 hours ago | parent | prev [-]

Saying they are all the same when no fidelity is lost is missing the point. The only distinction between logs, traces, and metrics is literally what to do when fidelity is lost.

If you have insufficient ingestion rate:

Logs are for events that can be independently sampled and be coherent. You can drop arbitrary logs to stay within ingestion rate.

Traces are for correlated sequences of events where the entire sequence needs to be retained to be useful/coherent. You can drop arbitrary whole sequences to stay within ingestion rate.

Metrics are pre-aggregated collections of events. You pre-limited your emission rate to fit your ingestion rate at the cost of upfront loss of fidelity.

If you have adequate ingestion rate, then you just emit your events bare and post-process/visualize your events however you want.

▲

m3047 4 hours ago | parent | next [-]

Good summary IMO.

> You can drop arbitrary logs to stay within ingestion rate.

Another way I've heard this framed in a production environments ingesting a firehose is: you can drop individual logging events because there will always be more.

	▲	otterley 3 hours ago \| parent [-]
		It depends. Some cases like auditing require full fidelity. Others don’t. Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer. The right way to think about logs, IMO, is less like diagnostic information and more like business records. If you change the framing of the problem, you might solve it in different way.

▲

otterley 7 hours ago | parent | prev [-]

> If you have insufficient ingestion rate

I would rather fix this problem than every other problem. If I'm seeing backpressure, I'd prefer to buffer locally on disk until the ingestion system can get caught up. If I need to prioritize signal delivery once the backpressure has resolved itself, I can do that locally as well by separating streams (i.e. priority queueing). It doesn't change the fundamental nature of the system, though.