| ▲ | fabian2k 3 days ago |
| Looks interesting for something like local development. I don't intend to run production object storage myself, but some of the stuff in the guide to the production setup (https://garagehq.deuxfleurs.fr/documentation/cookbook/real-w...) would scare me a bit: > For the metadata storage, Garage does not do checksumming and integrity verification on its own, so it is better to use a robust filesystem such as BTRFS or ZFS. Users have reported that when using the LMDB database engine (the default), database files have a tendency of becoming corrupted after an unclean shutdown (e.g. a power outage), so you should take regular snapshots to be able to recover from such a situation. It seems like you can also use SQLite, but a default database that isn't robust against power failure or crashes seems suprising to me. |
|
| ▲ | lxpz 3 days ago | parent | next [-] |
| If you know of an embedded key-value store that supports transactions, is fast, has good Rust bindings, and does checksumming/integrity verification by default such that it almost never corrupts upon power loss (or at least, is always able to recover to a valid state), please tell me, and we will integrate it into Garage immediately. |
| |
| ▲ | agavra 2 days ago | parent | next [-] | | Sounds like a perfect fit for https://slatedb.io/ -- it's just that (an embedded, rust, KV store that supports transactions). It's built specifically to run on object storage, currently relies on the `object_store` crate but we're consdering OpenDAL instead so if Garage works with those crates (I assume it does if its S3 compatible) it should just work OOTB. | | |
| ▲ | evil-olive 2 days ago | parent [-] | | for Garage's particular use case I think SlateDB's "backed by object storage" would be an anti-feature. their usage of LMDB/SQLite is for the metadata of the object store itself - trying to host that metadata within the object store runs into a circular dependency problem. |
| |
| ▲ | johncolanduoni 2 days ago | parent | prev | next [-] | | I’ve used RocksDB for this kind of thing in the past with good results. It’s very thorough from a data corruption detection/rollback perspective (this is naturally much easier to get right with LSMs than B+ trees). The Rust bindings are fine. It’s worth noting too that B+ tree databases are not a fantastic match for ZFS - they usually require extra tuning (block sizes, other stuff like how WAL commits work) to get performance comparable to XFS/ext4. LSMs on the other hand naturally fit ZFS’s CoW internals like a glove. | |
| ▲ | fabian2k 2 days ago | parent | prev | next [-] | | I don't really know enough about the specifics here. But my main points isn't about checksums, but more something like WAL in Postgres. For an embedded KV store this is probably not the solution, but my understanding is that there are data structures like LSM that would result in similar robustness. But I don't actually understand this topic well enough. Checksumming detects corruption after it happened. A database like Postgres will simply notice it was not cleanly shut down and put the DB into a consistent state by replaying the write ahead log on startup. So that is kind of my default expectation for any DB that handles data that isn't ephemeral or easily regenerated. But I also likely have the wrong mental model of what Garage does with the metadata, as I wouldn't have expected that to be ever limited by Sqlite. | | |
| ▲ | lxpz 2 days ago | parent [-] | | So the thing is, different KV stores have different trade-offs, and for now we haven't yet found one that has the best of all worlds. We do recommend SQLite in our quick-start guide to setup a single-node deployment for small/moderate workloads, and it works fine. The "real world deployment" guide recommends LMDB because it gives much better performance (with the current status of Garage, not to say that this couldn't be improved), and the risk of critical data loss is mitigated by the fact that such a deployment would use multi-node replication, meaning that the data can always be recovered from another replica if one node is corrupted and no snapshot is available. Maybe this should be worded better, I can see that the alarmist wording of the deployment guide is creating quite a debate so we probably need to make these facts clearer. We are also experimenting Fjall as an alternate KV engine based on LSM, as it theoretically has good speed and crash resilience, which would make it the best option. We are just not recommending it by default yet, as we don't have much data to confirm that it works up to these expectations. |
| |
| ▲ | BeefySwain 3 days ago | parent | prev | next [-] | | (genuinely asking) why not SQLite by default? | | |
| ▲ | lxpz 3 days ago | parent [-] | | We were not able to get good enough performance compared to LMDB. We will work on this more though, there are probably many ways performance can be increased by reducing load on the KV store. | | |
| ▲ | srcreigh 2 days ago | parent | next [-] | | Did you try WITHOUT ROWID? Your sqlite implementation[1] uses a BLOB primary key. In SQLite, this means each operation requires 2 b-tree traversals: The BLOB->rowid tree and the rowid->data tree. If you use WITHOUT ROWID, you traverse only the BLOB->data tree. Looking up lexicographically similar keys gets a huge performance boost since sqlite can scan a B-Tree node and the data is contiguous. Your current implementation is chasing pointers to random locations in a different b-tree. I'm not sure exactly whether on disk size would get smaller or larger. It probably depends on the key size and value size compared to the 64 bit rowids. This is probably a well studied question you could find the answer to. [1]: https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/4efc8... | | |
| ▲ | lxpz 2 days ago | parent | next [-] | | Very interesting, thank you. It would probably make sense for most tables but not all of them because some are holding large CRDT values. | |
| ▲ | asa400 a day ago | parent | prev [-] | | Other than knowing this about SQLite beforehand, is there any way one could discover that this is happening through tracing? |
| |
| ▲ | tensor 2 days ago | parent | prev | next [-] | | Keep in mind that write safety comes with performance penalties. You can turn off write protections and many databases will be super fast, but easily corrupt. | |
| ▲ | rapnie 2 days ago | parent | prev | next [-] | | I learned that Turso apparently have plans for a rewrite of libsql [0] in Rust, and create a more 'hackable' SQLite alternative altogether. It was apparently discussed in this Developer Voices [1] video, which I haven't yet watched. [0] https://github.com/tursodatabase/libsql [1] https://www.youtube.com/watch?v=1JHOY0zqNBY | |
| ▲ | skrtskrt 2 days ago | parent | prev [-] | | Could you use something like Fly's Corrosion to shard and distribute the SQLite data?
It uses a CRDT reconciliation, which is familiar for Garage. | | |
| ▲ | lxpz 2 days ago | parent [-] | | Garage already shards data by itself if you add more nodes, and it is indeed a viable path to increasing throughput. |
|
|
| |
| ▲ | __padding a day ago | parent | prev | next [-] | | I’ve not looked at it in a while but sled/rio were interesting up and coming options
https://github.com/spacejam/sled | |
| ▲ | ndyg a day ago | parent | prev | next [-] | | Fjall https://github.com/fjall-rs/fjall | |
| ▲ | __turbobrew__ 2 days ago | parent | prev | next [-] | | RocksDB possibly. Used in high throughput systems like Ceph OSDs. | |
| ▲ | patmorgan23 2 days ago | parent | prev | next [-] | | Valkey? | |
| ▲ | VerifiedReports 2 days ago | parent | prev [-] | | It's "key/value store", FYI | | |
| ▲ | kqr 2 days ago | parent | next [-] | | It's not a store of "keys or values", no. It's a store of key-value pairs. | | |
| ▲ | VerifiedReports 2 days ago | parent [-] | | A key-value store would be a store of one thing: key values. A hyphen combines two words to make an adjective, which describes the word that follows: A used-car lot
A value-added tax
A key-based access system
When you have two exclusive options, two sides to a situation, or separate things; you separate them with a slash: An on/off switch
A win/win situation
A master/slave arrangement
Therefore a key-value store and a key/value store are quite different. | | |
| ▲ | kqr 2 days ago | parent [-] | | All of your slash examples represent either–or situations. A swich turns it on or off, the situation is a win in the first outcome or a win in the second outcome, etc. It's true that key–value store shouldn't be written with a hyphen. It should be written with an en dash, which is used "to contrast values or illustrate a relationship between two things [... e.g.] Mother–daughter relationship" https://en.wikipedia.org/wiki/Dash#En_dash I just didn't want to bother with typography at that level of pedanticism. | | |
| ▲ | VerifiedReports 2 days ago | parent [-] | | No, they don't. A master/slave configuration (of hard drives, for example) involves two things. I specifically included it to head off the exact objection you're raising. "...the slash is now used to represent division and fractions, as a date separator, in between multiple alternative or related terms" -Wikipedia And what is a key/value store? A store of related terms. And if you had a system that only allowed a finite collection of key values, where might you put them? A key-value store. | | |
| ▲ | kqr 2 days ago | parent [-] | | The hard drives are either master or slave. A hard drive is not a master-and-slave. | | |
| ▲ | VerifiedReports a day ago | parent [-] | | Exactly. And an entry in a key/value store is either a key or a value. Not both. | | |
| ▲ | kqr a day ago | parent [-] | | No, an entry is a key-and-value pair. Are you deriously suggesting it is possible to add only keys without corresponding values, or vice versa? |
|
|
|
|
|
| |
| ▲ | abustamam 2 days ago | parent | prev | next [-] | | Wikipedia seems to find "key-value store" an appropriate term. https://en.wikipedia.org/wiki/Key%E2%80%93value_database | | | |
| ▲ | DonHopkins 2 days ago | parent | prev [-] | | Which is infinite of value is zero. |
|
|
|
| ▲ | yupyupyups 2 days ago | parent | prev | next [-] |
| Depending on the underlying storage being reliable is far from unique to garage. This is what most other services do too, unless we're talking about something like Ceph which manages the physical storage itself. Standard filesystems such as ext4 and xfs don't have data checksumming, so you'll have to rely on another layer to provide integrity. Regardless, that's not garage's job imo. It's good that they're keeping their design simple and focus their resources on implementing the S3 spec. |
|
| ▲ | moffkalast 3 days ago | parent | prev | next [-] |
| That's not something you can do reliably in software, datacenter grade NVMe drives come with power loss protection and additional capacitors to handle that gracefully. If power is cut at the wrong moment the partition may not be mountable afterwards otherwise. If you really live somewhere with frequent outages, buy an industrial drive that has a PLP rating. Or get a UPS, they tend to be cheaper. |
| |
| ▲ | crote 3 days ago | parent [-] | | Isn't that the entire point of write-ahead logs, journaling file systems, and fsync in general? A roll-back or roll-forward due to a power loss causing a partial write is completely expected, but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted? As I understood it, the capacitors on datacenter-grade drives are to give it more flexibility, as it allows the drive to issue a successful write response for cached data: the capacitor guarantees that even with a power loss the write will still finish, so for all intents and purposes it has been persisted, so an fsync can return without having to wait on the actual flash itself, which greatly increases performance. Have I just completely misunderstood this? | | |
| ▲ | unsnap_biceps 3 days ago | parent | next [-] | | you actually don't need capacitors for rotating media, Western Digital has a feature called "ArmorCache" that uses the rotational energy in the platters to power the drive long enough to sync the volatile cache to a non volatile storage. https://documents.westerndigital.com/content/dam/doc-library... | | | |
| ▲ | Aerolfos 2 days ago | parent | prev | next [-] | | > but surely consumer SSDs wouldn't just completely ignore fsync and blatantly lie that the data has been persisted? That doesn't even help if fsync() doesn't do what developers expect: https://danluu.com/fsyncgate/ I think this was the blog post that had a bunch more stuff that can go wrong too: https://danluu.com/deconstruct-files/ But basically fsync itself (sometimes) has dubious behaviour, then OS on top of kernel handles it dubiously, and then even on top of that most databases can ignore fsync erroring (and lie that the data was written properly) So... yes. | |
| ▲ | Nextgrid 3 days ago | parent | prev [-] | | > ignore fsync and blatantly lie that the data has been persisted Unfortunately they do: https://news.ycombinator.com/item?id=38371307 | | |
| ▲ | btown 3 days ago | parent [-] | | If the drives continue to have power, but the OS has crashed, will the drives persist the data once a certain amount of time has passed? Are datacenters set up to take advantage of this? | | |
| ▲ | Nextgrid 3 days ago | parent | next [-] | | > will the drives persist the data once a certain amount of time has passed Yes, otherwise those drives wouldn't work at all and would have a 100% warranty return rate. The reason they get away with it is that the misbehavior is only a problem in a specific edge-case (forgetting data written shortly before a power loss). | |
| ▲ | unsnap_biceps 3 days ago | parent | prev [-] | | Yes, the drives are unaware of the OS state. |
|
|
|
|
|
| ▲ | igor47 3 days ago | parent | prev | next [-] |
| I've been using minio for local dev but that version is unmaintained now. However, I was put off by the minimum requirements for garage listed on the page -- does it really need a gig of RAM? |
| |
| ▲ | dsvf 2 days ago | parent | next [-] | | I always understood this requirement as "garage will run fine on hardware with 1GB RAM total" - meaning the 1GB includes the RAM used by the OS and other processes. I think that most current consumer hardware that is a, potential garage host, even on the low end, has at least 1GB total RAM. | |
| ▲ | archon810 3 days ago | parent | prev | next [-] | | The current latest Minio release that is working for us for local development is now almost a year old and soon enough we will have to upgrade. Curious what others have replaced it with that is as easy to set up and has a management UI. | | |
| ▲ | mbreese 2 days ago | parent [-] | | I think that's part of the pitch here... swapping out Minio for Garage. Both scale a lot more than for just local development, but local dev certainly seems like a good use-case here. |
| |
| ▲ | lxpz 3 days ago | parent | prev [-] | | It does not, at least not for a small local dev server. I believe RAM usage should be around 50-100MB, increasing if you have many requests with large objects. |
|
|
| ▲ | nijave 2 days ago | parent | prev [-] |
| The assumption is nodes are in different fault domains so it'd be highly unlikely to ruin the whole cluster. LMDB mode also runs with flush/syncing disabled |