What about LSM trees? Something like RocksDB is very efficient at deleting. A delete operation is a tiny write (a tombstone) and then the actual deletion is done via compaction in the background with entire tablets being freed at once when the live data is evacuated. It's actually most efficient when deleting large ranges - when just replacing data it's not so efficient due to the write amplification.

That said, I agree with you in general. An under-used technique is to simply encrypt everything and then use a key store that is guaranteed capable of deleting data. This makes it easy to comply with legal deletion requests even across backups, though of course, you need to manage the keys very carefully and ensure that they are also backed up.

▲

jandrewrogers 7 months ago | parent | next [-]

Encryption that allows precise deletion of records in databases is quite famously pathological for performance and cost. Databases that work this way have existed for many decades and almost no one uses them because the performance is unacceptably terrible.

The reason may not be obvious. At the limit, database data structures and algorithms are a collection of data compression algorithms, though we typically don’t think of databases this way. Using encryption in front of that compression renders it worthless, and most database performance is predicated on the ability to use compressive representations of the data. Encrypting the data forces the most naive and inefficient data structures imaginable.

	▲	mike_hearn 7 months ago \| parent [-]
		Yes, encrypted databases aren't that useful for fine grained data especially if you get into academic computable encryption schemes. For document DBs where you're storing files like big JSON structures, PDFs, or for archival data, etc it can work out. Though mostly it's not worth it because key management is too hard.

▲

latency-guy2 7 months ago | parent | prev | next [-]

Well tombstoning is fundamentally punting the operation, the data is still there taking up space and computation if the flagged entry does not get removed from varying levels of query plans.

I agree that it meets the requirements for batched DELETE, and that's likely as best as we can make it.

But I wonder if there was a better way. I know there are research DBs out there that experimented with reusing the tombstone entry for new INSERT/UPDATE operations, but these suck when you want to do batched INSERT/UPDATE on a range since they're scattered all about in a table, and you lose ordering + monotonic properties.

	▲	mike_hearn 7 months ago \| parent [-]
		The way tombstones work in a sorted KV store like RocksDB is that queries walk up the levels, and the moment a tomb stone is hit the walk stops because it's known the keys are deleted. Then when the levels are compacted the live data is evacuated into a new tablet, so it's like generational GC. The cost scales with live data, not how much data there is in total. The problem of course is you pay that cost over and over again.

▲

selecsosi 7 months ago | parent | prev [-]

+1 This was our strategy at TempoIQ for our ts storage engine (built on top of rocks).

Very efficient and effective at managing tons of data ingestion (and deletion) at scale.

Not an easy out of the box tech to build on top of though when you have to build all the analytics/management pieces that something like PG gets you so I get the lack of public examples