Every time I see this kind of article, no one really bothers about sb/server redundancy, load balancers, etc. are we ok with just 1 big server that may fail and bring several services down?

You saved a lot of money but you'll spend a lot of time in maintenance and future headaches.

▲ grey-area 6 hours ago | parent | next [-]

It depends on the service and how critical that website is.

Sometimes it's completely acceptable that a server will run for 10 years with say 1 week or 1 month of downtime spread over those 10 years, yes. That's the sort of uptime you can see with single servers that are rarely changed and over-provisioned as many on Hetzner are. Some examples:

Small businesses where the website is not core to operations and is more of a shop-front or brochure for their business.

Hobby websites too don't really matter if they go down for short periods of time occasionally.

Many forums and blogs just aren't very important too and downtime is no big deal.

There are a lot of these websites, and they are at the lower end of the market for obvious reasons, but probably the majority of websites in fact, the long tail of low-traffic websites.

Not everything has to be high availability and if you do want that, these providers usually provide load balancers etc too. I think people forget here sometimes that there is a huge range in hosting from squarespace to cheap shared hosting to more expensive self-hosted and provisioned clouds like AWS.

▲

rzz3 5 hours ago | parent | next [-]

What struck me though is that OP did so much work to migrare the server with zero downtime. The _single_ big server. Something’s off here.

▲

grey-area 5 hours ago | parent | next [-]

Well why have downtime if you can avoid it with a bit of work?

But I do agree the poster should think about this. I don't think it's 'off' or misleading, they just haven't encountered a hardware error before. If they had one on this single box with 30 databases and 34 Nginx sites it would probably be a bad time, and yes they should think about that a bit more perhaps.

They describe a db follower for cutover for example but could also have one for backups, plus rolling backups offsite somewhere (perhaps they do and it just didn't make it into this article). That would reduce risk a lot. Then of course they could put all the servers on several boxes behind a load-balancer.

But perhaps if the services aren't really critical it's not worth spending money on that, depends partly what these services/apps are.

	▲	nine_k 3 hours ago \| parent [-]
		Besides, "Migrated 34 websites in one go with zero downtime" looks good on a resume, and is actually a useful skill.

▲

anticorporate 5 hours ago | parent | prev | next [-]

I run internal services on DO that I've considered moving to Hetzner for cost savings.

Could I take it down for the afternoon? Sure. Or could I wait and do it after hours? Also sure. But would I rather not have to deal with complaints from users that day and still go home by 5pm? Of course!

▲

BorisMelnik 5 hours ago | parent | prev | next [-]

to be fair a lot of ppl still run this way and just have really good backups, or have an offline / truly on-prep server where they can flip the dns switch in case of true outage.

	▲	grey-area 5 hours ago \| parent [-]
		Yes and for many services that is totally fine. As long as you have backups of data and can redeploy easily. It's not how I personally do things usually but there is definitely a place for it.

▲

chairmansteve 5 hours ago | parent | prev | next [-]

Good point. I run single big servers. But I can bring them down every weekend for the entire weekend if I need to.

▲

j45 5 hours ago | parent | prev [-]

There is software that can help a lot.

Also, in general, you can architect your application to be more friendly to migration. It used to be a normal thing to think about and plan for.

VMware has a conversion tool that converts bare metal into images.

One could image, then do regular snapshots, maybe centralize a database being accessed.

Sometimes it's possible to create a migration script that you run over and over to the new environment for each additional step.

Others can put a backup server in between to not put a load on the drive.

Digital Ocean makes it impossible to download your disk image backups which is a grave sin they can never be forgiven for. They used to have some amount of it.

Still, a few commands can back up the running server to an image, and stream it remotely to another server, which in turn can be updated to become bootable.

This is the tip of the iceberg in the number of tasks that can be done.

Someone with experience can even instruct LLMs to do it and build it, and someone skilled with LLMs could probably work to uncover the steps and strategies for their particular use case.

▲

wild_egg 4 hours ago | parent | prev | next [-]

A week of downtime every decade I think still works out to a higher uptime than I've been getting from parts of GitHub lately. So I'd consider that a win.

▲

j45 5 hours ago | parent | prev | next [-]

Respectfully, this type of "high availability" strawman is a dated take.

This is a general response to it.

I have run hosting on bare metal for millions of users a day. Tens of thousdands of concurrent connections. It can scale way up by doing the same thing you do in a cloud, provision more resources.

For "downtime" you do the same thing with metal, as you do with digital ocean, just get a second server and have them failover.

You can run hypervisors to split and manage a metal server just like Digital Ocean. Except you're not vulnerable to shared memory and cpu exploits on shared hosting like Digital Ocean. When Intel CPU or memory flaws or kernel exploits come out like they have, one VM user can read the memory and data of all the other processes belonging to other users.

Both Digital Ocean, and IaaS/PaaS are still running similar linux technologies to do the failover. There are tools that even handle it automatically, like Proxmox. This level of production grade fail over and simplicity was point and click, 10 years ago. Except no one's kept up with it.

The cloud is convenient. Convenience can make anyone comfortable. Comfort always costs way more.

It's relatively trivial to put the same web app on a metal server, with a hypervisor/IaaS/Paas behind the same Cloudflare to access "scale".

Digital Ocean and Cloud providers run on metal servers just like Hetzner.

The software to manage it all is becoming more and more trivial.

	▲	nh2 an hour ago \| parent \| next [-]
		While I generally agree, this is an exaggeration: > This level of production grade fail over and simplicity was point and click, 10 years ago. While some of the tools are _designed_ for point and click, they don't always work. Mostly because of bugs. We run Ceph clusters under our product, and have seen a fair share of non-recoveries after temporary connection loss [1], kernel crashes [2], performance degradations on many small files, and so on. Similarly, we run HA postgres (Stolon), and found bugs in its Go error checking cause failure to recover from crashes and full-disk conditions [3] [4]. This week, we found that full-disk situations will not necessarily trigger failovers. We also found that if DB connections are exhausted, the dameon that's supposed to trigger postgres failover cannot connect to do that (currently testing the fix). I believe that most of these things will be more figured out with hosted cloud solutions. I agree that self-hosting HA with open-source software is the way to. These softwares are good, and the more people use them, the less bugs they will have. But I wouldn't call it "trivial". If you have large data, it is also brutally cheaper; we could hire 10 full-time sysadmins for the cost of hosting on AWS, vs doing our own Hetzner HA with Free Software, and we only need ~0.2 sysadmins. And it still has higher uptime than AWS. It is true that Proxomox is easy to setup and operate. For many people it will probably work well for a long time. But when things aren't working, it's not so easy anymore. [1]: "Ceph does not recover from 5 minute network outage because OSDs exit with code 0" - https://tracker.ceph.com/issues/73136 [2]: "Kernel null pointer derefecence during kernel mount fsync on Linux 5.15" - https://tracker.ceph.com/issues/53819 [3]: https://github.com/sorintlab/stolon/issues/359#issuecomment-... [4]: https://github.com/sorintlab/stolon/issues/247
	▲	grey-area 5 hours ago \| parent \| prev [-]
		I'm not arguing for cloud or against bare metal hosting, just saying there is a broad range of requirements in hosting and not everyone needs or wants load balancers etc - it clearly will cost more than this particular poster wants to pay as they want to pay the bare minimum to host quite a large setup.

▲

jijijijij 5 hours ago | parent | prev [-]

I feel like 95% of the web falls into this category. Like, have you ever said "That's it, I am never gonna visit this page again!", because of temporary downtime? Unless you are Amazon and every minute costs you bazillions, you are likely gonna get the better deal not worrying about availability and scalability. That 250€/m root server is a behemoth. Complete overkill for most anything. As a bonus, you are gonna be half the internet, when someone at AWS or Cloudflare touches DNS.

▲

coryrc 5 hours ago | parent | next [-]

Exactly. I've never not bought something because the website was temporarily down. I've even bought from b&h photo!

Even if Amazon was down, if I was planning to buy, I'd wait. heck, I got a bunch of crap in my cart right now I haven't finished out.

Intentional downtime lets everyone plan around it, reduces costs by not needing N layers of marginal utility which are all fragile and prone to weird failures at times you don't intend.

	▲	jijijijij 4 hours ago \| parent [-]
		For me at least, the only thing where availability really matters is main personal communication services. If Signal was down for an hour, I'd be a little stressed. Maybe utilities like public transportation, too, but that's because I now have to do that online. > Intentional downtime lets everyone plan around it, reduces costs by not needing N layers of marginal utility which are all fragile and prone to weird failures at times you don't intend. Quite frankly, I would manage if things were run "on-supply" with solar and would just go dark at night.

▲

Aurornis 5 hours ago | parent | prev | next [-]

> Like, have you ever said "That's it, I am never gonna visit this page again!", because of temporary downtime?

That's a strawman version of what happens.

There have been times when I've tried to visit a webshop to buy something but the site was broken or down, so I gave up and went to Amazon and bought an alternative.

I've also experienced multiple business situations where one of our services went down at an inconvenient time, a VP or CEO got upset, and they mandated that we migrate away from that service even if alternatives cost more.

If you think of your customers or visitors as perfectly loyal with infinite patience then downtime is not a problem.

> Unless you are Amazon and every minute costs you bazillions, you are likely gonna get the better deal not worrying about availability and scalability. That 250€/m root server is a behemoth. Complete overkill for most anything.

You don't need every minute of downtime to cost "bazillions" to justify a little redundancy. If you're spending 250 euros/month on a server, spending a little more to get a load balancer and a pair of servers isn't going to change your spend materially. Having two medium size servers behind a load balancer isn't usually much more expensive than having one oversized server handling it all.

There are additional benefits to having the load balancer set up for future migrations, or to scale up if you get an unexpected traffic spike. If you get a big traffic spike on a single server and it goes over capacity you're stuck. If you have a load balancer and a pair of servers you can easily start a 3rd or 4th to take the extra traffic.

	▲	jijijijij 5 hours ago \| parent [-]
		> There have been times when I've tried to visit a webshop to buy something but the site was broken or down, so I gave up and went to Amazon and bought an alternative. Great. So how much did the webshop lose in that hour of maintenance (which realistically would be in the middle of the night for their main audience) and how much would they have paid for redundancy? Also a bit hard to believe you repeatedly ran into the situation of an item sold at a self-hosted webshop and Amazon alike. Are you sure they haven't just messed up the web dev biz? You could totally do that with AWS too... > If you're spending 250 euros/month on a server, spending a little more to get a load balancer and a pair of servers isn't going to change your spend materially. Of course, but that's not the argument. It's implied you can just double the 250€/m server for redundancy, as you would still get an offer at the fraction of cloud prices. But really that server needs no more optimization in terms of hardware diversification. As I said, it's complete overkill. Blogs and forums could easily be run on a 30€/m recycled machine.

▲

thelastgallon 5 hours ago | parent | prev [-]

> Like, have you ever said "That's it, I am never gonna visit this page again!"

Spot on! People still go to Chick-fil-A, even if they are closed on Sundays!

▲ Aurornis 5 hours ago | parent | prev | next [-]

These articles are popular where there's a mismatch between application requirements and the solution chosen. When someone over-engineers their architecture to be enterprise-grade (substitute your own definition of enterprise-grade) when really they were running a hobby project or a small business where a day of downtime every once in a while just means your customers will come back the next day, going all-out on cloud architecture is maybe not necessary. That's why you see so many comments from people arguing that downtime isn't always a big deal or that risking an outage is fine: There are a lot of applications where this is kind of true.

The confusing part about this article is the emphasis on a zero-downtime migration toward a service that isn't really ideal for uptime. It wouldn't be that expensive to add a little bit of architecture on the Hetzner side to help with this. I guess if you're doing a migration and you're paid salary or your time is free-ish, doing the migration in a zero downtime way is smart. It's a little funny to see the emphasis on zero downtime juxtaposed to the architecture they chose where uptime depends on nothing ever failing

▲

j45 5 hours ago | parent [-]

Downtime is a strawman.

Clever architecture will always beat cleverly trying to pick only one cloud.

Being cloud agnostic is best.

This means setting up a private cloud.

Hosted servers, and managed servers are perfectly capable of near zero downtime. this is because it's the same equipment (or often more consumer grade) that the "cloud" works on and plans for even more failure.

Digital Ocean definitely does not guarantee zero downtime. That's a lot of 9's.

It's simple to run well established tools like Proxmox on bare metal that will do everything Digital Ocean promises, and it's not susceptible to attacks, or exploits where the shared memory and CPU usage will leak what customers believe is their private VPS.

Nothing ever failing in the case of a tool like Proxmox is, install it on two servers, one VPS exists on both nodes (you connect both servers as nodes), click high availability, and it's generally up and running. Put cloudflare in front of it like the best preference practices of today.

If you're curious about this, there's some pretty eye opening and short videos on Proxmox available on Youtube that are hard to unsee.

▲

nine_k 3 hours ago | parent [-]

Sadly, hardware breaks. You still need a working backup and a working failover plan, even if it's just setting up a new server and running your Terraform / Pulumi / Saltstack scripts.

	▲	j45 an hour ago \| parent [-]
		I'm not sure if you read my post. When you have 2 nodes running, both are mirrored and running, one can have hardware break. Also, hardware can provide failure notifications before it breaks, and experience teaches to just update and upgrade before hard drives break. Since tools like proxmox just add a node, you add new hardware, mark the VM for that node to mirror, and it is taken care of. Terraform etc can sit below Proxmox and alleviate what you're speaking about: Some examples: https://www.youtube.com/watch?v=dvyeoDBUtsU

▲ grebc 16 minutes ago | parent | prev | next [-]

Downtime happens in all different contexts of life that a web site/service being knocked offline is soo far down the priority list for most people.

It’s amusing that the US government can shutdown for days/weeks/months over budget reasons and there’s no adult discussions that take place about fixing the cause. Yet the latest HN demo that 100 people will use need all 9’s reliability and hundreds of responses.

▲ chillfox 5 hours ago | parent | prev | next [-]

A lot of things don't need that.

Also, don't underestimate the reliability of simplicity.

I was a Linux sysadmin for many years, and I have never seen as much downtime from simpler systems as I routinely see from the more complicated setups. Somewhere between theory and reality, simpler systems just comes out ahead most of the time.

▲ daneel_w 6 hours ago | parent | prev | next [-]

They may be making this decision based on a long history of, in fact, never really having run into "a lot of time in maintenance and future headaches".

	▲	VorpalWay 5 hours ago \| parent [-]
		To be fair, I migrated a VPS from Linode to Hetzner a few years ago. Minor downtime is a non-issue: personal website and email server. I approximately halved the monthly cost, and I haven't had any downtime except what I caused myself when rebooting to upgrade the kernel every now and then. As a bonus, Hetzner is European.

▲ wiether 5 hours ago | parent | prev | next [-]

To be fair they were using a single VM on DigitalOcean, so they didn't had the perks of a cloud provider, except maybe the fact that a VM is probably more fault-tolerant than a bare metal server.

Usually those articles describe two situations:

  - they were "on the cloud" for the wrong reasons and migrating to something more physical is the right approach
  - they were "on the cloud" for the right reasons and migrating to something more physical is going to be a disaster

Here they appear to be in the first situation. If their setup was running fine on DO and they put the right DR policies in place at Hetzner, they should be fine.

▲ ahofmann 5 hours ago | parent | prev | next [-]

In 20 years of hosting all kinds of web services, some of them serving over 200m requests per month, a crashing single server was twice a problem.

Dealing with over engineered bullshit, that behaved in strange ways that disrupted the service was far more often a problem.

So, yes, redundancy is something that can be left away, if you're comfortable to be responsible for fixing things at a Saturday morning.

	▲	jijijijij 3 hours ago \| parent [-]
		People also tend underestimate how much compute these dedicated servers got, compared to cloud offerings, and what that feels like without 100 layers of management abstraction in-between. You are likely not going to ever choke a plenty-cored, funny-RAMed root server at a fraction of your cloud costs. This overkill resource estate can be the answer to a lot of scalability worries. It's always there, no sharing shit all.

▲ pier25 2 hours ago | parent | prev | next [-]

I don't know about Hetzner but with Upcloud and Vultr my single VPS setups have been more reliable than multiregion with redundancy setups with other providers like Fly.

	▲	stephenhuey 2 hours ago \| parent [-]
		A few weeks ago, I tested deploying Rails apps to Hetzner and Vultr for the first time using Hatchbox to deploy Rails apps onto them. I'm still supporting clients on Heroku, but there are potential new projects in the coming months that I might deploy elsewhere. Render is decent in some cases, but you can get a lot of bang for your buck deploying on Vultr, and Hatchbox makes it easy to do, whether you have one instance or a cluster. Hatchbox also helps with putting multiple apps/domains on a single server, a concept I had to give up long ago on Heroku. I've thought about deploying to DO plenty of times over the years, but there was always Heroku, and if I had to find a new home for Rails 8, I think I'd skip it in favor of a more powerful Vultr server. Hatchbox can provision Postgres for you, but Vultr has managed Postgres which is appealing to me. Or if you're just using Sqlite with Rails 8, that's easy to do with Hatchbox but not on Render since Render has an ephemeral file system.

▲ neya 5 hours ago | parent | prev | next [-]

I was thinking the same. A managed database is just set and forget pretty much. I do NOT miss the old times where I had to monitor my email from routine security checkups hoping my database didn't get hacked by some script kiddie accompanied by blackmail over email.

▲ chalmovsky 5 hours ago | parent | prev | next [-]

What are you running on it is the only question which matters, obviously you dont want air traffic control to go down but some app… So what if it goes down? Backup is somewhere else if you even need it anyway. Github has uptime less than 90% according to this: https://mrshu.github.io/github-statuses/ . And the world keeps turning. Obviously we should strive for better, but also lets please not continue making this uptime fetish out of it, for vast majority of the apps it absolutely doesnt fucking matter.

▲ littlecranky67 5 hours ago | parent | prev | next [-]

To be fair, modern dedicated servers at hetzner have two power units, and come with a redundand ssd/hdd raid-1 config. AFAIK both ssd and power unit having hotplug capability, so in case either fails they can be replaced with zero downtime.

Given the downtimes we saw in the past year(s) (AWS, Cloudflare, Azure - the later even down several times), I would argue moving to any of the big cloud providers give you not much of a better guarantee.

I myself am a Hetzner customer with a dedicated vServer, meaning it is a shared virtual server but with dedicated CPUs (read: still oversubscribed, but some performance guarantee) and had zero hardware-based downtime for years [0]. I would guess their vservers are on similar redundant hardware where the failing components can be hotswapped.

[0] = They once within the last 3 years sent me an email that they had to update a router that would affect network connectivity for the vServer, but the notification came weeks in advance and lasted about 15 minutes. No reboot/hardware failure on my vServer though.

▲ wg0 5 hours ago | parent | prev | next [-]

If you have the setup within server fully scripted and automated (bash, pyinfra or ansible etc) and backups are in place then recovery isn't that hard. Downtime for sure maybe couple of hours for which you can point your DNS entries to a static page while you're restoring everything.

Not a bad tradeoff for 99.8% of shops out there.

▲ supermatt 5 hours ago | parent | prev | next [-]

They already were on "1 big server" (a single VPS at digital ocean) and moved to another "1 big server" (a managed server at hertzner).

They saved money and lost nothing.

Now, if they so wish, they could use a portion of that to increase redundancy - but that wasn't the point of the article.

▲ jdboyd 5 hours ago | parent | prev | next [-]

DO doesn't do high availability droplets, and their migration policy is will try, if we detect poor health of server before it fails.

If someone starts thinking about redundancy and load balancers than DO's solution is rent a second similar sized droplet, and then add their load balancing service. If you do those things with Hetzner instead, you would still be spending less than you did with Digital Ocean.

Personally, what is keeping me on DO is that no single droplet I have is large enough to justify moving on its own, and I'm not prepared to deal with moving everything.

▲ ozim 5 hours ago | parent | prev | next [-]

If you can restore from snapshot to a new instance on cloud provider having running second copy is waste of money.

I know people like FAANG LARPing. Not everyone has budget or need to run four nines with 24/7 and FAANG level traffic.

▲ timwis 6 hours ago | parent | prev | next [-]

I wondered the same! FWIW I'm currently migrating from managed postgres to self-managed on hetzner with [autobase](https://autobase.tech/). Though of course for high availability it requires more than one server.

▲ BorisMelnik 5 hours ago | parent | prev | next [-]

I agree with you, even for the servers I am responsible for I always make decisions like putting db on supabase instead of local, hosting files on s3 with versioning/multi region etc. then of course come up with a backup and snapshot system.

▲ Gud 5 hours ago | parent | prev | next [-]

What time in maintenance? Hetzner has been rock solid for me.

▲ pinkgolem 5 hours ago | parent | prev | next [-]

Tbh, my one server paperless deployment has a higher uptime then most services.

If your scaling need is not that high, you can get very far with a single server

▲ PunchyHamster 5 hours ago | parent | prev | next [-]

their original also run on single server ?

If you can tolerate few hours of downtime and some data rollback/loss, single server + robust backups can be viable strategy

▲ jgalt212 5 hours ago | parent | prev | next [-]

Hetzner has cheap load balancers and VMs.

▲ izacus 3 hours ago | parent | prev | next [-]

I had like... less than 10 minutes downtime on Hetzner in years (funny enough, that makes my personal containers more reliable than productionized AWS and GCP deployments with their constant partial outages). So perhaps all that complexity (beyond maybe a backup container) isn't really necessary for companies where a bit of downtime doesn't really affect revenue?

Like, I know Leetcode tells otherwise, but most companies really don't need full FAANG stack with 99.999% uptime. A day of outage in a few years isn't going affect bottom lines.

▲ surgical_fire 5 hours ago | parent | prev | next [-]

The vast majority of services are actually alright with a little downtime here and there. In exchange, maintenance is a lot simpler with less moving parts.

People underestimate how far you can go with one or two servers. In fact, what I have seen in ky career is many examples of services that should have been running on one or two servers and instead went for a hugely complex microserviced approach, all in on Cloud providers and crazy requirements of reliability for a scale that never would come.

▲ NicoJuicy 5 hours ago | parent | prev [-]

Depends on the app and how long downtime would take.

Deploying a new docker instance or just restoring the app from a snapshot and restoring the latest db in most cases is enough.