| ▲ | nostrademons 4 hours ago | |
Google solved most of these problems around 2005, with tools like LOG_EVERY_N (now part of absl [1]), Dapper [2], and several other tools that aren't public yet. You can trace an individual request through every internal system, view the request/response protobufs, every log that the server emitted, timing details, etc. More to the point, you can share this trace, which means that it's possible for one person to discover the bug, reproduce it, and then have another person in a completely different office/timezone/country debug it, even if the latter cannot reproduce the bug themselves. This has proved hugely useful; just last week I was tasked with reproducing a bug on sparsely-available prerelease hardware so that a distant team could diagnose what went wrong. The key insight that this article hints at but doesn't quite get too: you should treat your logs as a product whose customers are the rest of the devs in your company. The way you log things is intimately connected with what you want to do with them, and you need to build systems to generate useful insights from the log statements. In some cases it literally is part of the product: many of the machine learning systems that generate recommendations, search results, spam filtering, abuse detection, traffic direction, etc. are all based on the logs for the product, and you need to consider them as first-class citizens that you absolutely cannot break while adding new features. Logs are not just for debugging. [1] https://absl.readthedocs.io/en/latest/absl.logging.html [2] https://research.google/pubs/dapper-a-large-scale-distribute... | ||