Remix.run Logo
rdoherty 9 hours ago

This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.

I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!

wparad 8 hours ago | parent | next [-]

Thank you!

One of the question I frequently get is "do you automatically rollback". And I have hide in the corner and say "not really". Often, if you knew a rollback would work, you probably could also have known to not roll out in the first place. I've seen a lot of failures that only got worse when automation attempted to turn the thing on and off again.

Luckily from an automation roll-out standpoint, it's not that much harder to test in isolation. The harder parts to validate are things like "Does a Route 53 Failover Record really work in practice at the moment we actually need it to work?"

Usually the answer is yes, but then there's always the "but it too could be broken", and as you said, it's turtles all the way down.

The nice part is realistically, the automation for dealing with rollout and IaC is small and simple. We've split up our infrastructure to go with individual services, so each piece of infra is also straight forward.

In practice, our infra is less DRY and more repeated, which has the benefit of avoiding complexity that often comes from attempting to reduce code duplication. The ancillary benefit is that, simple stuff changes less frequently. Less frequent changes because less opportunity for issues.

Not-surprisingly, most incidents comes from changes humans make. Where the second most amount of incidents come from assumptions humans make about how a system operates in edge conditions. If you know these two things to be 100% true, you spend more time designing simple systems and attempting to avoid making changes as much as possible, unless it is absolutely required.

evanmoran 9 hours ago | parent | prev [-]

Iac is definitely a failure point, but the manual alternative is much worse! I’ve had a lot of benefit from using pulumi, simply because the code can be more compact than the terraform hcl was.

For example, for the fall over regions (from the article) you could make a pulumi function that parameterizes only the n things that are different per fall over env and guarantee / verify the scripts are nearly identical. Of course, many people use modules / terragrunt for similar reasons, but it ends up being quite powerful.

wparad 8 hours ago | parent | next [-]

I think some people are going to scream when I say this, but we're using mostly CloudFormation templates.

We don't use the CDK because it introduces complexity into the system.

However to make CloudFormation usable, it is written in typescript, and generates the templates on the fly. I know that sounds like the CDK, but given the size of our stacks, adding an additional technology in, doesn't make things simpler, and there is a lot of waste that can be removed, by using a software language rather than using json/yaml.

There are cases we have some OpenTofu, but for infrastructure resources that customer specific, we have deployments that are run in typescript using the AWS SDK for javascript.

It would be nice if we could make a single change and have it roll-out everywhere. But the reality is that there are many more states in play then what is represented by a single state file. Especially when it comes to interactions between—our infra, our customer's configuration, and the history of requests to change the configuration, as well as resources with mutable states.

One example of that is AWS certificates. They expire. We need them expiring. But expiring certs don't magically update state files or stacks. It's really bad to make assumptions about a customer's environment based on what we thought we knew the last time a change was rolled out.

xyzzy123 8 hours ago | parent | prev | next [-]

I actually like terraform for its LACK of power (tho yeah these days when I have a choice I use a lot of small states and orchestrate with tg).

Pulumi or CDK are for sure more powerful (and great tools) but when I need to reach for them I also worry that the infra might be getting too complex.

wparad 8 hours ago | parent | next [-]

Agreed, it is much too easy to fall into bad habits. The whole goal of OpenTofu is declarative infrastructure. With CDK and pulumi, it's very easy to end up in a place where you lose that.

But if you need to do something in a particular way, the tools should never be an obstacle.

yearolinuxdsktp 8 hours ago | parent | prev [-]

IMO Pulumi and CDK are an opportunity to simplify your infra by capturing what you’re working with using higher-level abstractions and by allowing you to refactor and extract reusable pieces at any level. You can drive infra definitions easily from typed data structures, you can add conditionals using natural language syntax, and stop trying to program in a configuration language (Terraform HCL with surprises like non-short-circuited AND evaluation).

You still end up having IaaC. You can still have a declarative infrastructure.

andrewaylett 7 hours ago | parent | next [-]

That's how we use CDK. Our CDK (in general) creates CloudFormation which we then deploy. As far as the tooling which we have for IaC is concerned, it's indistinguishable from hand-written CloudFormation — but we're able to declare our intent at a higher level of abstraction.

xyzzy123 7 hours ago | parent | prev [-]

Absolutely, the best case is it's much better, safer, readable etc. However, the worst case is also worse. From the perspective of someone who provides devops support to multiple teams, terraform is more "predictable".

spyspy 8 hours ago | parent | prev [-]

If you do use terraform, for the love of god do NOT use Terraform Cloud. Up there with Github in the list of least reliable cloud vendors. I always have a "break glass" method of deploying from my work machine for that very reason.