I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things:

- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant

- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant

- Potentially try to push additional logging or instrumentation

- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data

I certainly can't strace or run raw CLI commands on a host in production.

▲

reactordev 6 hours ago | parent | next [-]

Combined with stack traces of the events, this is the way.

If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.

I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.

▲

zinodaur 4 hours ago | parent | prev [-]

> I certainly can't strace or run raw CLI commands on a host in production.

Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

To me, being without those capabilities just feels crippling

	▲	joshuamorton 4 hours ago \| parent [-]
		> Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair? A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege, not from a security perspective (though there are obvious upsides to this), but from a debugging/clarity perspective. If you have a relatively small and statically verifiable set of (networked) dependencies, and minimize which resources which containers can access, reasoning about the system as a whole becomes a lot easier. I can think of lots of cases where I've witnessed good outcomes from moving towards more fine-grained resource access, and very few cases where life has gotten better by saying "everyone has access to everything".