Remix.run Logo
dev_l1x_be 4 days ago

> But I struggle to be as productive

You should calculate TCO in productivity. Can you write Python/Go etc. faster? Sure! Can you operate these in production with the same TCO as Rust? Absolutely not. Most of the time the person debugging production issues and data races is different than the one who wrote the code. This gives the illusion of productivity being better with Python/Go.

After spending 20+ years around production systems both as a systems and a software engineer I think that Rust is here for reducing the TCO by moving the mental burden to write data race free software from production to development.

zelphirkalt 3 days ago | parent | next [-]

TCO? Tail call optimization? But that doesn't make sense in this context. hm.

bodski 3 days ago | parent [-]

https://en.wikipedia.org/wiki/Total_cost_of_ownership

jchw 4 days ago | parent | prev [-]

With the notable inclusion of Google where the SRE team is usually separate from the SWE team (but it wasn't for my particular case) I actually was always doing operations and code at all of my jobs at least at points and usually during most of the job. This is in part my own election. I do mean all of them though I don't really love listing my work history publicly everywhere just to keep some separation.

So, my first job actually started as a pure Python gig. Operations for Python/Django absolutely sucked ass. Deploying Django code reliably was a serious challenge. We got better over time by using tools like Vagrant and Docker and eventually Kubernetes, so the differences between production and dev/testing eventually faded and become less notable. But frankly no matter what we did, not causing production issues with Django/Python was a true-to-life nightmare. Causing accidental type errors not caught by tests was easy and MyPy couldn't really cover all that much of the code easily, and the Django ORM was very easy to accidentally cause horrible production behavior with (that, of course, would look okay locally with tiny amounts of data.) This is actually the original reason why I switched to Go in the first place, at my first job in around 2016. The people who I worked with are still around to attest to this fact, if you want I can probably get them to chime in on this thread, I still talk to some of them.

Go was a totally different story. Yes, we did indeed have some concurrency pains, which really didn't exist in Python for obvious reasons, but holy shit, we could really eek a lot of performance out of Go code compared to Python. We were previously afraid we might have to move data heavy workloads from Twisted (not related to the Django stuff) to something like C++ or maybe even optimized Java, but Go handily took it and allowed us to saturate the network interface on our EC2 boxes. (A lot of communications were going over Websockets, and the standards for compression in websockets took a long time to settle and become universally supported, so we actually played with implementing the lz4 compression scheme in JS. I wound up writing my own lz4 implementation based on the algorithms, I believe, from the C version. It wound up being too much compute, though. But, we had to try, anyway.)

So how much reliability problems did we wind up having doing all this? Honestly not a whole lot on the Go side of things. The biggest production issue I ever ran into was one where the Kubernetes AWS integration blew up because we wound up having too many security groups. I wound up needing to make an emergency patch to kubelet in the early hours to solve that one :) We did run into at least one serious Go related issue over time, which was indeed concurrency related: when Go 1.6 came out, it started detecting concurrent misuses of maps. And guess what? We had one! It wasn't actually triggering very often, but in some cases we could run into a fairly trivial concurrent map access. It didn't seem to crash before but it could at least cause some weird behaviors in the event that it actually triggered before Go 1.6; now it was a crash that we could debug. It was a dumb mistake and it definitely underscores the value of borrow checking; "just don't mess up" will never prevent all mistakes, obviously. I will never tell you that I think borrow checking is useless, and really, I would love to just always write 100% correct software all the time.

That said though, that really is most of the extent of the production issues we had with Go. Go was a serious workhorse and we were doing reasonably non-trivial things in Go. (I had essentially built out a message queue system for unreliable delivery of very small events. We had a firehose of data coming in with many channels of information and needed to route those to the clients that needed them and handle throttling/etc. Go was just fantastic at this task.) Over time things got easier too, as Go kept updating and improving, helping us catch more bugs.

I can only come to one conclusion: people who treat Go and Python in the same class are just ignorant to the realities of the situation. There are cases where Rust will be immensely valuable because you really can't tolerate a correctness problem, but here's the thing about that Go concurrent map access issue: while it could cause some buggy behavior and eventually caused some crashing, it never really caused any serious downtime or customer issues. The event delivery system was inherently dealing with unreliable data streams, and we had multiple instances. If there was a blip, clients would just reconnect and people would barely notice anything even if they were actively logged in. (In fact, we really didn't do anything special for rolling deployments to this service, because the frontend component was built to just handle a disconnection gracefully. If it reconnected quickly enough, there was no visual disturbance.)

That's where the cost/benefit analysis gets tricky though. Python and Django and even Twisted are actually pretty nice and I'm sure it's even better than when we originally left it (to be clear we did still have some minor things in Django after that, too, but they were mostly internal-only services.) Python and Django had great things like the built-in admin panel which, while it couldn't solve everyone's needs, was pretty extensible and usable on its own. It took us a while to outgrow it for various use cases. Go has no equivalent to many Django conveniences, so if you haven't fully outgrown e.g. the Django admin panel and ORM, it's hard to fully give up on those features.

Throughout all of this, we had a lot more issues with our JS frontend code than we ever did with either Python/Django or Go, though. We went through trying so many things to fix that, including Elm and Flow, and eventually the thing that really did fix it, TypeScript. But that is another story. (Boy, I sure learned a lot on my first real career job.)

At later jobs, Go continued to not be at the center of most of the production issues I faced running Go software. That's probably partly because Go was not doing a lot of the most complicated work, often times the most complicated bits were message queues, databases and even to some degree memory caches, and the Go bits were mostly acting like glue (albeit definitely glue with application logic, to be sure.)

So is the TCO of Go higher than Rust? I dunno. You can't really easily measure it since you don't get to explore parallel universes where you made different choices.

What I can say is that Go has been a choice I never regretted making all the way from the very first time and I would choose it again tomorrow.