I already accounted for consumed resources when I said 10 million instead of 100 million. I allocated 10% to logging overhead. If your service is within 10% of overload you are already in for a bad time. And frankly, what systems are you using that are handling 10 million requests per second per core (100 nanoseconds per request)? Hell, what services are you deploying that you even have 10 million requests per second per core to handle?

All of those other costs are, again, trivial with proper design. You can easily handle billions of events per second on the backend with even a modest server. This is done regularly by time traveling debuggers which actually need to handle these data rates. So again, what are we even deploying that has billions of events per second?

▲

otterley 6 hours ago | parent [-]

In my experience working at AWS and with customers, you don't need billions of TPS to make an end-to-end logging infrastructure keel over. It takes much less than that. As a working example, you can host your own end-to-end infra (the LGTM stack is pretty easy to deploy in a Kubernetes cluster) and see what it takes to bring yours to a grind with a given set of resources and TPS/volume.

▲

Veserv 6 hours ago | parent [-]

I prefaced all my statements with the assumption that the chosen logging system is not poorly designed and terribly inefficient. Sounds like their logging solutions are poorly designed and terribly inefficient then.

It is, in fact, a self-fulfilling prophecy to complain that logging can be a bottleneck if you then choose logging that is 100-1000x slower than it should be. What a concept.

▲

otterley 6 hours ago | parent [-]

At the end of the day, it comes down to what sort of functionality you want out of your observability. Modest needs usually require modest resources: sure, you could just append to log files on your application hosts and ship them to a central aggregator where they're stored as-is. That's cheap and fast, but you won't get a lot of functionality out of it. If you want more, like real-time indexing, transformation, analytics, alerting, etc., it requires more resources. Ain't no such thing as a free lunch.

▲

dpark 5 hours ago | parent [-]

Surely you aren’t doing real time indexing, transformation, analytics, etc in the same service that is producing the logs.

A catastrophic increase in logging could certainly take down your log processing pipeline but it should not create cascading failures that compromise your service.

▲

otterley 5 hours ago | parent [-]

Of course not. Worst case should be backpressure, which means processing, indexing, and storage delays. Your service might be fine but your visibility will be reduced.

	▲	dpark 5 hours ago \| parent [-]
		For sure. Your can definitely tip over your logging pipeline and impact visibility. I just wanted to make sure we weren’t still talking about “causing a cascading outage due to increased log volumes” as was mentioned above, which would indicate a significant architectural issue.