Horrid advice at the end about logging every error, exception, slow request, etc if you are sampling healthy requests.

Taking slow requests as an example, a dependency gets slower and now your log volume suddenly goes up 100x. Can your service handle that? Are you causing a cascading outage due to increased log volumes?

Recovery is easier if your service is doing the same or less work in a degraded state. Increasing logging by 20-100x when degraded is not that.

▲

46Bit 3 hours ago | parent | next [-]

What we're doing at Cloudflare (including some of what the author works on) samples adaptively. Each log batch is bucketed based on a few fields, and in each bucket if there's lots of logs in each bucket we only keep the sqrt or log of the number of input logs. It works really well... but part of why it works well is we always have blistering rates of logs, so can cope with spikes in event rates without the sampling system itself getting overwhelmed.

▲

trevor-e 8 hours ago | parent | prev | next [-]

Yea that was my thought too. I like the idea in principle, but these magic thresholds can really bite you. It claims to be P(99), probably off some historical measurement, but that's only true if it's dynamically changing. Maybe this could periodically query the OTEL provider for the real number to at least limit the time window of something bad happening.

▲

otterley 8 hours ago | parent | prev | next [-]

It’s an important architectural requirement for a production service to be able to scale out their log ingestion capabilities to meet demand.

Besides, a little local on-disk buffering goes a long way, and is cheap to boot. It’s an antipattern to flush logs directly over the network.

▲

lanstin 4 hours ago | parent [-]

And everything logging from the API to the network to the ingestion pipeline needs to be best effort - configure a capacity and ruthlessly drop msgs as needed, at all stages. Actually a nice case for UDP :)

	▲	otterley 3 hours ago \| parent [-]
		It depends. Some cases like auditing require full fidelity. Others don’t. Plus, if you’re offering a logging service to a customer, the customer’s expectation is that once successfully ingested, your service doesn’t drop logs. If you’re violating that expectation, this needs to be clearly communicated to and assented by the customer.

▲

golem14 5 hours ago | parent | prev | next [-]

For high volume services, you can still log a sample of healthy requests, e.g., trace_id mod 100 == 0. That keeps log growth under control. The higher the volume, the smaller percentage you can use.

▲

Veserv 7 hours ago | parent | prev | next [-]

I do not see how logging could bottleneck you in a degraded state unless your logging is terribly inefficient. A properly designed logging system can record on the order of 100 million logs per second per core.

Are you actually contemplating handling 10 million requests per second per core that are failing?

▲

otterley 7 hours ago | parent [-]

Generation and publication is just the beginning (never mind the fact that resources consumed by an application to log something are no longer available to do real work). You have to consider the scalability of each component in the logging architecture from end to end. There's ingestion, parsing, transformation, aggregation, derivation, indexing, and storage. Each one of those needs to scale to meet demand.

▲

Veserv 7 hours ago | parent [-]

I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?

All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?

▲

otterley 7 hours ago | parent [-]

In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.

▲

Veserv 6 hours ago | parent [-]

I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.

It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.

▲

otterley 6 hours ago | parent [-]

At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.

▲

dpark 5 hours ago | parent [-]

Surely you aren’t doing real time indexing, transformation, analytics, etc in the same service that is producing the logs.

A catastrophic increase in logging could certainly take down your log processing pipeline but it should not create cascading failures that compromise your service.

▲

otterley 5 hours ago | parent [-]

Of course not. Worst case should be backpressure, which means processing, indexing, and storage delays. Your service might be fine but your visibility will be reduced.

	▲	dpark 5 hours ago \| parent [-]
		For sure. Your can definitely tip over your logging pipeline and impact visibility. I just wanted to make sure we weren’t still talking about “causing a cascading outage due to increased log volumes” as was mentioned above, which would indicate a significant architectural issue.

▲

XCSme 5 hours ago | parent | prev | next [-]

Good point. It also reminded me of when I was trying to optimize my app for some scenarios, then I realized it's better to optimize it for ALL scenarios, so it works fast and the servers can handle no matter what. To be more specific, I decided NOT to cache any common queries, but instead make sure that all queries are fast as possible.

▲

debazel 7 hours ago | parent | prev | next [-]

My impression was that you would apply this filter after the logs have reach your log destination, so there should be no difference for your services unless you host your own log infra, in which case there might be issues on that side. At least that's how we do it with Datadog because ingestion is cheap but indexing and storing logs long term is the expensive part.

▲

Cort3z 6 hours ago | parent | prev [-]

Just implement exponential backoff for slow requests logging, or some other heuristic, to control it. I definitely agree it is a concern though.