Remix.run Logo
vintagedave 2 hours ago

Hi Jake. Appreciate your presence here on HN.

This affected a seemingly random set of services across three of my accounts (pro and hobby, depending on if this is for work or just myself.) That ranges from Wordpress to static site hosting to a custom Python server. All of the deployments showed as Online, even after receiving a SIGTERM.

While 3% is 'good', that's an awfully wide range of things across multiple accounts for me, so it doesn't feel like 3% ;) Please publish the post mortem. I am a big fan of Railway but have really struggled with the amount of issues recently. You don't want to get Github's growing rep. Some people are already requesting I move one key service away, since this is not the first issue.

Finally, can I make a request re communication:

> If you are experiencing issues with your deployment, please attempt a re-deploy.

Why can't Railway restart or redeploy any affected service? This _sounds_ like you're requiring 3% of your users to manually fix the issue. I don't know if that's a communication problem or the actual solution, but I certainly had to do it manually, server by server.

justjake 2 hours ago | parent [-]

Totally! People who see the impact will likely see more impacted than say, 3% of their services. Not all disruption created equal.

We rolled out a change to update our fraud model, and that uses workload fingerprinting

Since, in all likelyhood, your projects are similarly structured, there will be more impacted workloads if the shape of your workloads was in the "false positive" set

Will have more information soon but very valid (and astute) feelings!

vintagedave an hour ago | parent [-]

> We rolled out a change to update our fraud model, and that uses workload fingerprinting

> Since, in all likelyhood, your projects are similarly structured...

Thanks for the info. For what it's worth and to inform your retrospective, this included:

* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday

* A Docusaurus-generated static site. Completely static.

* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)

These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.

Some things come to mind for your post-mortem:

* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.

* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.

* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.