Remix.run Logo
crawshaw 9 hours ago

The idea that an "observability stack" is going to replace shell access on a server does not resonate with me at all. The metrics I monitor with prometheus and grafana are useful, vital even, but they are always fighting the last war. What I need are tools for when the unknown happens.

The tool that manages all my tools is the shell. It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation. Take it away and you are left with a server that is resilient against things you have seen before but lacks the tools to deal with the future.

ValdikSS 9 hours ago | parent | next [-]

>It is where I attach a debugger, it is where I install iotop and use it for the first time. It is where I cat out mysterious /proc and /sys values to discover exotic things about cgroups I only learned about 5 minutes prior in obscure system documentation.

It is, SSH is indeed the tool for that, but that's because until recently we did not have better tools and interfaces.

Once you try newer tools, you don't want to go back.

Here's the example of my fairly recent debug session:

    - Network is really slow on the home server, no idea why
    - Try to just reboot it, no changes
    - Run kernel perf, check the flame graph
    - Kernel spends A LOT of time in nf_* (netfilter functions, iptables)
    - Check iptables rules
    - sshguard has banned 13000 IP addresses in its table
    - Each network packet travels through all the rules
    - Fix: clean the rules/skip the table for established connections/add timeouts
You don't need debugging facilities for many issues. You need observability and tracing.

Instead of debugging the issue for tens of minutes at least, I just used observability tool which showed me the path in 2 minutes.

johnisgood 5 minutes ago | parent | next [-]

Your example is a shell debugging session. You ran perf, checked iptables, inspected sshguard - all via SSH (or locally). The "observability tool" here is shell access to system utilities.

This proves the parent's point: when the unknown happens, you need a shell.

IgorPartola 8 hours ago | parent | prev | next [-]

See I would not reboot the server first before figuring out what is happening. You lose a lot of info by doing that and the worst thing that can happen is that the problem goes away for a little bit.

gerdesj 7 hours ago | parent | next [-]

To be fair, turning it off and on again is unreasonably effective.

I recently diagnosed and fixed an issue with Veeam backups that suddenly stopped working part way through the usual window and stopped working from that point on. This particular setup has three sites (prod, my home and DR), and five backup proxies. Anyway, I read logs and Googled somewhat. I rebooted the backup server - no joy, even though it looked like the issue was there. I restarted the proxies and things started working again.

The error was basically: there are no available proxies, even though they were all available (but not working but not giving off "not working" vibes).

I could bother with trying to look for what went wrong but life is too short. This is the first time that pattern has happened to me (I'll note it down mentally and it was logged in our incident log).

So, OK, I'll agree that a reboot should not generally be the first option. Whilst sciencing it or nerding harder is the purist approach, often a cheeky reboot gets the job done. However, do be aware that a Windows box will often decide to install updates if you are not careful 8)

rurban an hour ago | parent | next [-]

Turning it off and on again is risky. I recently upgraded a robot in Australia, had problems with systemd, so I turned it off. And had to wait a few weeks until it could be turned on again, because tailscaled was not setup persistently, the routing was not setup properly (over a phone), the machine had some problems,...

High risk, low reward. But of course the ultimate test if it's properly setup.

But on the other hand, with my tiny hard real-time embedded controllers, a power cycle is the best option. No persistent state, fast power up, reboot in milliseconds. Every little SW error causes a reboot, no problem at all.

rurban an hour ago | parent | prev | next [-]

Turning it off and on again is risky. I recently upgraded a robot in Australia, had problems with systemd, so I turned it off. And had to wait a few weeks until it could be turned on again, because tailscaled was not setup persistently, the routing was not setup properly (over a phone), the machine had some problems,...

High risk, low reward. But of course the ultimate test if it's properly setup.

akerl_ 6 hours ago | parent | prev [-]

No, you didn’t diagnose and fix an issue.

You just temporarily mitigated it.

abrookewood 5 hours ago | parent [-]

Sometimes that is enough - especially for home machines etc.

akerl_ 5 hours ago | parent [-]

I’ve got no problem with somebody choosing to mitigate something instead of fixing it. But it’s just incorrect to apply a blind mitigation and declare that you’ve diagnosed the problem.

butvacuum 2 hours ago | parent [-]

what's the ROI on that?

-- leadership

galleywest200 7 hours ago | parent | prev | next [-]

My job as a DevOps engineer is to ensure customer uptime. If rebooting is the fastest, we do that. Figuring out the why is the primary developers’ jobs.

This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

My job may be different than other’s as I work at an ITSP and we serve business phone lines. When business phones do not work it is immediately clear to our customers. We have to get them back up not just for their business but for the ability for them to dial 911.

tolciho 3 hours ago | parent [-]

> This is also a good reason to log everything all the time in a human readable way. You can get services up and then triage at your own pace after.

Unless, hypothetically, the logging velocity tickles kernel bugs and crashes the system, but only when the daemon is started from cron and not elsewhere. Hypothetically, of course.

Or when the system stops working two weeks after launch because "logging everything" has filled up the disk, and took two weeks to so do. This also means important log messages (perhaps that the other end is down) might be buried in 200 lines of log noise and backtrace spam per transaction, which in turn might delay debugging and fixing or at isolating at which end of the tube the problem resides.

butvacuum 8 hours ago | parent | prev | next [-]

most failstates arent worth preserving in a SMB environment. In larger environments or ones equipped for it a snapshot can be taken before rebooting- should the issue repeat.

Once is chance, twice is coincidence, three times makes a pattern.

Ferret7446 6 hours ago | parent [-]

Alternatively, if it doesn't happen again it's not worth fixing, if it does happen again then you can investigate it when it happens again.

ValdikSS 7 hours ago | parent | prev [-]

I've debugged so many issues in my life that sometimes I'd prefer things to just work, and if reboot helps to at least postpone the problem, I'd choose that :D

butvacuum 2 hours ago | parent [-]

seriously, and sometimes it's just not worth investigating. which means its never going to get fixed, and I'd rather go home than create another ticket that'll just get stale and age out.

gerdesj 8 hours ago | parent | prev | next [-]

I fail to understand how your approach is different to your parent.

perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

If you are advocating newer tools, look into nft - iptables is sooo last decade 8) I've used the lot: ipfw, ipchains, iptables and nftables. You might also try fail2ban - it is still worthwhile even in the age of the massively distributed botnet, and covers more than just ssh.

I also recommend a VPN and not exposing ssh to the wild.

Finally, 13,000 address in an ipset is nothing particularly special these days. I hope sshguard is making a properly optimised ipset table and that you running appropriate hardware.

My home router is a pfSense jobbie running on a rather elderly APU4 based box and it has over 200,000 IPs in its pfBlocker-NG IP block tables and about 150,000 records in its DNS tables.

ValdikSS 8 hours ago | parent [-]

>perf is a shell tool. iptables is a shell tool. sshguard is a log reader and ultimately you will use the CLI to take action.

Well yes, and to be honest in this case I did that all over SSH: run `perf`, generate flame graph, copy the .svg to the PC over SFTP, open it in the file viewer.

What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command.

Take a look at Netflix presentation, especially on their web interface screenshots: https://archives.kernel-recipes.org/wp-content/uploads/2025/...

>look into nft - iptables is sooo last decade

It doesn't matter in this context: iptables is using new netfilter (I'm not using iptables-legacy), and this exact scenario is 100% possible with native netfilter nft.

>Finally, 13,000 address in an ipset is nothing particularly special these days

Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well, but I wish I just had it as a dashboard picture from the start.

I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.

kalaksi 31 minutes ago | parent | next [-]

> What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command.

For this reason, I've created Lightkeeper: https://github.com/kalaksi/lightkeeper. I made it mostly for my own needs to simplify repetitive tasks and provide an efficient view for monitoring. Also has graphs as a recent addition, but screenshots don't show it. You can also drop to a terminal with a hotkey any time.

Ironically, it works over SSH without any additional daemons.

gerdesj 7 hours ago | parent | prev | next [-]

"What I really wanted is a web interface which will just show me EVERYTHING it knows about the system in a form of charts, graphs, so I can just skim through it and check if everything allright visually, without using the shell and each individual command."

Yes, we all want that. I've been running monitoring systems for over 30 years and it is quite a tricky thing to get right. .1.3.1.4.1.33230 is my company enterprise number, which I registered a while back.

The thing is that even though we are now in 2026, monitoring is still a hard problem. There are, however, lots of tools - way more than we had in the day but just like a saw can rip your finger off instead of cutting a piece of wood, well I'm sure you can fill in the blanks.

Back in the day we had a thing called Ethereal which was OK and nearly got buried. However you needed some impressive hardware to use it. Wireshark is a modern marvel and we all have decent hardware. SNMP is still relevant too.

Although we have stonking hardware these days, you do also have to be aware of the effects of "watching". All those stats have to be gathered and stashed somewhere and be analysed etc. That requires some effort from the system that you are trying to watch. That's why things like snmp and RRD were invented.

Anyway, it is 2026 and IT is still properly hard (as it damn well should be)!

gerdesj 7 hours ago | parent | prev [-]

>Oh, the other day I had just 70 `iptables -m set --match-set` rules, and did you know how apparently inefficient source/destination address hashing algorithm for the set match is?! It was debugged with perf as well!

>I'm talking about ~4Gbit/s sudden limitation on a 10Gbit link.

I think you need to look into things if 70 IPs in a table are causing issues, such that a 10Gb link ends up at four Gb/s. I presume that if you remove the ipset, that 10Gb/s is restored?

Testing throughput and latency is also quite a challenge - how do you do it?

kelnos 3 hours ago | parent | prev | next [-]

That only works if the people who built the observability tool have thought of everything. They haven't, of course; no one can.

It's great that you were able to solve this problem with your observability tools. But nothing will ever be as comprehensive as what you can do with shell access.

I don't get what the big deal is here. Just... use shell access when you need it. If you have other things in place that let you easily debug and fix some classes of issues, great. But some things might be easier to fix with shell access, and you could very easily run into something you can't figure out without ssh.

Completely disabling shell access is just making things harder for you. You don't get brownie points or magical benefits from denying yourself that.

DANmode 3 hours ago | parent [-]

> That only works if the people who built the observability tool have thought of everything. They haven't, of course; no one can.

but the tool draws data from forums of people who have had the problem I’m having before.

crawshaw 8 hours ago | parent | prev [-]

How did you use tracing to check the current state of a machine’s iptables rules?

ValdikSS 7 hours ago | parent [-]

In this case I used `perf` utility, but only because the server does not have a proper observability tool.

Take a look at this Netflix presentation, especially on the screenshots of their web interface tool: https://archives.kernel-recipes.org/wp-content/uploads/2025/...

crawshaw 7 hours ago | parent [-]

That is a command line tool run over ssh. If you have invented a new way to run command line tools, that’s great (and very possible, writing a service that can fork+exec and map stdio), but it is the equivalent to using ssh. You cannot run commands using traces.

charcircuit 7 hours ago | parent [-]

With that mindset anything is equivalent to ssh. The command line is not the pinnacle of user interfaces and giving admins full control of the machine isn't the pinnacle of security either.

We need to accept that UNIX did not get things right decades ago and be willing to evolve UX and security to a better place.

crawshaw 7 hours ago | parent [-]

Happy to try an alternative. Traces I have tried, and it is not an alternative.

raggi 2 hours ago | parent | prev | next [-]

Yep.

Observability stacks are a similar blind alley to containers: They solve a handful of defined problems and immediately fall down on their own KPI's around events handled/prevented in-place, efficiency, easier to use than what came before.

cryptonector an hour ago | parent | prev | next [-]

The problem lies in surveillance and others understanding what you did. Say your security department records every shell interaction with prod services: how does one then review and understand what happened? This is a fairly tricky problem. Perhaps through it at an LLM, but it'd have to be well trained to look for malicious actions.

reactordev 9 hours ago | parent | prev | next [-]

Or… you build a container, that runs exactly what you specify. You print your logs, traces, metrics home so you can capture those stack traces and error messages so you can fix it and make another container to deploy.

You’ll never attach a debugger in production. Not going to happen. Shell into what? Your container died when it errored out and was restarted as a fresh state. Any “Sherlock Holmes” work would be met with a clean room. We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

toast0 8 hours ago | parent | next [-]

> We have 10,000 nodes in the cluster - which one are you going to ssh into to find your container to attach a shell to it to somehow attach a debugger?

You would connect to any of the nodes having the problem.

I've worked both ways; IMHO, it's a lot faster to get to understanding in systems where you can inspect and change the system as it runs than in systems where you have to iterate through adding logs and trying to reproduce somewhere else where you can use interactive tools.

My work environment changed from an Erlang system where you can inspect and change almost everything at runtime to a Rust system in containers where I can't change anything and can hardly inspect the system. It's so much harder.

IgorPartola 8 hours ago | parent | prev [-]

Say you are debugging a memory leak in your own code that only shows up in production. How do you propose to do that without direct access to a production container that is exhibiting the problem, especially if you want to start doing things like strace?

joshuamorton 7 hours ago | parent [-]

I will say that, with very few exceptions, this is how a lot of $BigCo manage everyday. When I run into an issue like this, I will do a few things:

- Rollback/investigate the changelog between the current and prior version to see which code paths are relevant

- Use our observability infra that is equivalent to `perf`, but samples ~everything, all the time, again to see which codepaths are relevant

- Potentially try to push additional logging or instrumentation

- Try to better repro in a non-prod/test env where I can do more aggressive forms of investigation (debugger, sanitizer, etc.) but where I'm not running on production data

I certainly can't strace or run raw CLI commands on a host in production.

reactordev 6 hours ago | parent | next [-]

Combined with stack traces of the events, this is the way.

If you have a memory leak, wrap the suspect code in more instrumentation. Write unit tests that exercise that suspect code. Load test that suspect code. Fix that suspect code.

I’ll also add that while I build clusters and throw away the ssh keys, there are still ways to gain access to a specific container to view the raw logs and execute commands but like all container environments, it’s ephemeral. There’s spice access.

zinodaur 5 hours ago | parent | prev [-]

> I certainly can't strace or run raw CLI commands on a host in production.

Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

To me, being without those capabilities just feels crippling

joshuamorton 4 hours ago | parent [-]

> Have you worked the other way before? Where you have ssh access to machines (lots of them, when you need to do something big) that have all of your secrets, can talk to all of your dbs, and you can just compile + rsync binaries on to them to debug/repro/repair?

A lot of the problems I enjoy solving specifically relate to consistently minimizing privilege, not from a security perspective (though there are obvious upsides to this), but from a debugging/clarity perspective. If you have a relatively small and statically verifiable set of (networked) dependencies, and minimize which resources which containers can access, reasoning about the system as a whole becomes a lot easier.

I can think of lots of cases where I've witnessed good outcomes from moving towards more fine-grained resource access, and very few cases where life has gotten better by saying "everyone has access to everything".

ValdikSS 9 hours ago | parent | prev | next [-]

>What I need are tools for when the unknown happens.

There are tools which show what happens per process/thread and inside the kernel. Profiling and tracing.

Check Yandex's Perforator, Google Perfetto. Netflix also has one, forgot the name.

7 hours ago | parent | prev | next [-]
[deleted]
cyberax 6 hours ago | parent | prev | next [-]

Because you're holding it wrong!

The dashboards are something that looks cool, but they usually are not really helpful for debugging. What you're looking for is per-request tracing and logging, so you can grab a request ID and trace it (get log messages associated with it) through multiple levels of the stack. Even maybe across different services.

Debuggers are great, but they are not a good option for production traffic.

jeffbee 7 hours ago | parent | prev | next [-]

I guess the question is why your observability stack isn't exposing proc and sys for you.

crawshaw 6 hours ago | parent [-]

Mine (prometheus) doesn’t because there are a lot of high-dimensional values to track in /proc and /sys that would blow out storage on a time-series database. Even if they did though, they could not let me actively inject changes to a cgroup. What do you suggest I try that does?

jeffbee 6 hours ago | parent [-]

Experience from another company where I (and you) worked suggests that having the endpoints to expose the system metrics, without actually collecting and storing them, is the way to go.

crawshaw 6 hours ago | parent [-]

Years of debugging in that company’s restricted environments solidified my desire for shell access to production environments. I was there a month before I was hunting for breadcrumbs in a BINARY_INFO log that I had five minutes to grab before it was deleted.

jeffbee 6 hours ago | parent [-]

Well that's funny you mentioned it because one of my projects was a service that let users temporarily install binary info logs collectors triggered by predicates, remotely, which at least I thought was a better model than ssh into the host or, for the advanced caveman, pdsh into many hosts. I don't really see a reason why I can't do that for gRPC, either ...

But, anyway, remote command and control of observability really is a thing in the industry, not just at one company.

gear54rus 9 hours ago | parent | prev [-]

Agreed, this sounds like some complicated ass-backwards way to do what k8s already does. If it's too big for you, just use k3s or k0s and you will still benefit from the absolutely massive ecosystem.

But instead we go with multiple moving parts all configured independently? CoreOS, Terraform and a dependence on Vultr thing. Lol.

Never in a million years I would think it's a good idea to disable SSH access. Like why? Keys and non-standard port already bring China login attempts to like 0 a year.