| ▲ | joshuamorton 7 hours ago | |||||||
I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things: - Rollback/investigate the changelog between the current and prior version to see which code paths are relevant - Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant - Potentially try to push additional logging or instrumentation - Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data I certainly can't strace or run raw CLI commands on a host in production. | ||||||||
| ▲ | reactordev 6 hours ago | parent | next [-] | |||||||
Combined with stack traces of the events, this is the way. If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code. I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access. | ||||||||
| ▲ | zinodaur 4 hours ago | parent | prev [-] | |||||||
> I certainly can't strace or run raw CLI commands on a host in production. Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair? To me, being without those capabilities just feels crippling | ||||||||
| ||||||||