IS there a better solution for self-healing S3 storage that you could recommend? I'm also curious what will make a rook cluster croak after some time and what kind of maintenance is required in your experience.

▲

__turbobrew__ 2 days ago | parent | next [-]

I have unfortunately got a ceph cluster in a bad enough state that I just had to delete the pools and start from scratch. It was due to improper sequencing when removing OSDs, but that is kindof the point is you have to know what you are doing to know how to do things safely. For the most part I have so far learned by blundering things and learning hard lessons. Ceph clusters when mistreated can get into death spirals that only an experienced practitioner can advert through very carefully modifying cluster state through things like upmaps. You also need to make sure you understand your failure domains and how to spread mons and osds across the domains to properly handle failure. Lots of people don’t think about this and then one day a rack goes poof and you didn’t replicate your data across racks and you have data loss. Same thing with mons, you should be deploying mons across at least 3 failure domains (ideally 3 different datacenters) to maintain quorum during an outage.

▲

adamcharnock 2 days ago | parent | prev | next [-]

Not used it yet, but RustFS sounds like it has self healing

https://docs.rustfs.com/troubleshooting/healing.html

▲

adastra22 2 days ago | parent | prev [-]

ceph?

	▲	yupyupyups 2 days ago \| parent [-]
		Rook is ceph.