> During this incident, we discovered we had crossed a scale threshold where our log ingestion pipeline was being rate-limited and quietly discarding logs. Ironically, we ended up with less information as a result, which made it significantly harder to reconstruct what was actually happening.

Last year they posted about using New Relic, Datadog, and Grafana. Would this ‘silent deletion of log data due to quota’ problem be characteristic of any one of them in particular, or is it something we have to watch out for with all of them?

▲

mmcclure 5 hours ago | parent | next [-]

We don't use New Relic or Datadog (and never have, afaik), so I'm not sure what post you could be referring to for those two? We have talked publicly about our Grafana use, though, and going from an in-house stack to their cloud product. Actual OP can probably hop in later with a better answer, but it was hitting rate limits on the logging agent, not the logging system.

	▲	altairprime 3 hours ago \| parent [-]
		Ah! Thank you, that makes a lot more sense. I misunderstood https://data.mux.com/blog/off-with-our-head-how-we-re-making... as suggesting that Mux was making Mux core infrastructure ‘play nice with’ the various providers.

▲

drodman 3 hours ago | parent | prev [-]

In general you do need to be aware of any agent-level rate limits as well as any ingestion limits from the provider. We do some pretty careful sampling and aggregations for most metrics, logs, and traces we store and as mmcclure said in this case it was the rules on the node agents themselves throwing the errors. The volume logging on some of the critical paths of the service got high enough that the logs were dropped due to our configured rate limits.

	▲	3 hours ago \| parent [-]
		[deleted]