Remix.run Logo
zbentley 2 days ago

My point is that just because those problems can be solved with better telemetry doesn’t mean that is actually done in practice. Most organizations do are much more aware of/sensitive to failures upstream/at the edge than they are in backend services. Once you account for alert fatigue, crappy accountability distribution, and organizational pressures, even the places that do this well often backslide over time.

In brief: drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. Both groups are supposed to do those things, but they don’t and we should assume they won’t change. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.