▲ | jchw 4 days ago | |||||||||||||||||||||||||||||||||||||||||||
Look, this is pointless. I'm not learning anything new when you tell me that it can and will happen. How will it happen and how much will it happen? Hence linking to Uber's case study on the issue. The answer? Not that much. Uber started performing race detection in production over a 6 month period and found 2,000 different race conditions. Ouch, that sounds horrible! But wait, we're talking about 50 million lines of Go code and 2,100 services at the time of that writing. That means they were seeing approximately 1 race condition per 25,000 lines of code and about 1 race condition per service. That actually lines up pretty well with my experiences. Although I haven't had a production outage or serious correctness issue caused by a race condition in Go, I have seen probably about one or two race conditions that made it to production per service. I reckon those codebases were likely somewhere between 10,000 and 25,000 lines of code most likely, so not so far off of the scale. But again it doesn't always lead to a serious production outage, it's just that simple. It could be worse too (could corrupt some data and pollute your production database or something, in the worst case) but usually it's better (wonky behavior but no long-term effects, maybe the service periodically crashes but restarts, leading to some dropped requests but no long term downtime.) Uber has no doubt seen at least some Go data races that have caused actual production outages, but they've seen at least 2,000 Go data races that haven't, otherwise they would've probably been caught before the race detector caught them, Go dumps stacktraces on crash. That has to tell you something about the actual probability of causing a production outage due to a data race. Again, you do you, but I will not be losing sleep over this. It is something to be weary of when working on Go services, but it is manageable. | ||||||||||||||||||||||||||||||||||||||||||||
▲ | zozbot234 4 days ago | parent [-] | |||||||||||||||||||||||||||||||||||||||||||
Identifiable "wonky" behavior and periodic crashes seem like a very real issue to me. This wouldn't fly for any mission-critical service, it's something that demands a root cause analysis. Especially since it's hard to be sure after the fact that no data has been corrupted somehow or that security invariants have not been violated due to the "wonky" behavior. | ||||||||||||||||||||||||||||||||||||||||||||
|