Remix.run Logo
BuildTheRobots 2 days ago

> I've often thought about that when there's a work crisis: If I'm the second on the scene, what can I do to support those fighting the fire right now, before jumping in.

A lot of it depends on the size and skill-set of your team and the escalation routes available to you, but in general (and off the top of my head):

- Get the first people on scene to give a summary of the problem as they know it. Make sure everyone actually agrees on what the problem is and what symptoms have been observed. Understand what areas people are currently investigating and make sure they aren't trampling over each other or actually making the situation worse [1]

- Make sure the situation hasn't evolved whilst the first on scene have been investigating the initial symptoms. It's easy to get lost in the weeds digging into a handful of monitoring alerts only to look up and realise there's now 300 and the original problem is only a small part of what's going on.

- If there isn't one already and you're not better doing something else, become incident commander. When done right it's an extremely important and useful role.

  - Take over external communication and protect the team from distractions
  - Start assessing escalation options
  - Take copious notes and keep a timeline 
  - Act as a shared memory and keep people honest 
  - Have a less panicked, wider (non minutia) view of the problem
  - Start collating and pulling up documentation/schematics so the people at the coalface can quickly query it rather than getting distracted searching for it.
  - Be ready to jump, for when someone inevitably asks "can someone check..." or "does anyone know"
  - Keep track of the "shared truth" of the incident as it evolves. What have we witnessed, what do we believe is the cause, _why_ do we believe that? Have we invalidated anything, do we need to reassess, are we sure logical lynchpins aren't confirmation bias or dyslexia?
  - Onboard new people and hand over if appropriate.
Being at the coalface when it's on fire is a very different view of the world to watching other people panic and singe their fingers. It's also very easy to get lost in a chain of technical problems [2] when it's mostly irrelevant to the wider picture.

If you get a moment, it can also be a good time to assess how useful your monitoring is during an actual event.

[1] "Hey, server x has flagged on monitoring and my ssh session is hung waiting for a login prompt!" I've been round the houses enough to know this is probably OOM and if I just wait, I'm likely to finally get in. I also know that saying this in a room of 20 technical people, means the server is now processing 22 new ssh sessions and now no one is getting anywhere.

[2] The famous Malcolm in the Middle intro where Hal is tasked with changing a lightbulb and ends up repairing the car. Except in my example the bulb is actually fine and there's a power cut we missed. https://www.youtube.com/watch?v=AbSehcT19u0