Remix.run Logo
jandrewrogers 7 months ago

DELETE is expensive at a deep fundamental level that we don’t think about much in computer science because we are more worried about losing data. The article is about Postgres but it generalizes. We don’t actually have any computer science for DELETE optimized databases. I’ve idly looked into delete-optimization in databases as thought experiments, since there isn’t much in the way of literature on it, and it is far more difficult than I think many people intuit. The article is partly a manifestation of this reality.

I think the nature of DELETE is one of the more interesting open problems in computer science. It is one of those things that, when we require precision, turns out to be very difficult to define.

mpweiher 7 months ago | parent | next [-]

Yes...but it goes even deeper.

For example, in physics, the paradox of Maxwells Demon is resolved when you consider the cost of deleting data:

"In 1982, Charles Bennett showed that, however well prepared, eventually the demon will run out of information storage space and must begin to erase the information it has previously gathered.[8][12] Erasing information is a thermodynamically irreversible process that increases the entropy of a system."

https://en.wikipedia.org/wiki/Maxwell's_demon#Recent_progres...

It is also difficult for humans to delete information. In my humble and only a little facetious opinion this is one of the main drivers for ever new "To Do" apps: the existing app gets full because deleting is too hard, so we start fresh with a new app. The app isn't the point, the starting fresh is.

The underlying reason there being that the cost of maintaining small (to do) notes can be greater than the value of the note, which is one of the reasons we still use little scraps of paper and other mechanisms that will effectively auto-delete.

Understanding the micronote lifecycle: improving mobile support for informal note taking

https://dl.acm.org/doi/10.1145/985692.985779

lukevp 7 months ago | parent | next [-]

Every day I’d have more and more state accumulation on my machine - open apps, unsaved changes, open tabs. I’ve tried many methods for preventing this from happening over the years, but the only and most effective solution I’ve been using for the last year - a script I wrote that just quits every browser tab and every open app (leaving unsaved apps still running) every evening. I wake up and the machine is new and fresh. It’s amazing and has benefited my productivity so much. It changes the state game to where if I have a specific task I want to resume tomorrow I have to make myself a clear note to read the next day, and if I have tabs open that I care about, I have to bookmark them. What I’ve found is that the actual info that needs to be transferred between days is very small.

I have an endless reminders app and todo list. I wonder if something similar (items expire automatically unless you flag them as permanent or something) would help keep a clearer list. Sometimes ephemerality is best!

wenc 7 months ago | parent [-]

> Every day I’d have more and more state accumulation on my machine - open apps, unsaved changes, open tabs.

I resonate with your comment.

I grew up during a time when PC state was ephemeral (DOS days). Unsaved changes essentially meant lost data. Open apps were gone once the computer was shut down (and the computer was shut down daily - a 250W power supply had an expensive power draw; in contrast a Mac Mini only sips 1W when sleeping). This helped me develop a habit of keeping only necessary things open, bookmarking what I want to keep, and of habitually pressing Ctrl+S to save data. I never keep any tabs open (my browser loses all tabs upon close anyway), and my inbox is zero.

The cost I pay for this is context recovery -- every day, I have to essentially set everything up again. I do write notes or leave comments in code to remind myself where I left off, but I essentially started fresh. But there is an upside to this: I start each day from an uncluttered slate, which leads to clarity in my head. When context is ephemeral, I'm more likely to not be beholden to an existing state.

This actually helps me increase the quality of my writing and my code. It's akin to the heuristic of "throwing away your first draft and rewriting" to achieve higher quality. Write your code once, delete it, and write it again from scratch. The next version takes a quarter of the time to write but can be twice the quality because you've prototyped it in your head but you are not bound to your crufty first attempt. There was an HN discussion on this a while ago:

https://grantslatton.com/software-pathfinding

If you save state, you can resume more quickly, which is great if you're on a roll, but it's also an accumulation of more clutter that blocks certain more creative thoughts.

alexwasserman 7 months ago | parent | prev | next [-]

I like the idea of a todo list that comes with a built in auto-delete.

You either do your to dos, or it auto-deletes them for you. No worry about it getting full, but also some pressure to actually get them done or they'll be wiped.

And if you're happy they're wiped, then you probably didn't need to do it at all.

I wonder if there's something like that already.

teitoklien 7 months ago | parent | next [-]

you can already do it, with Apple’s native Shortcuts app and Reminders app. You can set an automation to delete all reminders after a fixed time has passed with the ‘Remove Reminders’ action

Otherwise you can just use cronjobs with a small python snippet to parse and decide what reminders tagged with which labels to delete after X,Y,Z many days and hit the APIs of most major todo apps like todoist, asana, etc to just delete those tasks. Heck works with your local markdown based todo lists too.

perryizgr8 7 months ago | parent | prev [-]

This was a pretty nice feature for me in the now defunct arc browser. It simply deletes tabs that you don't interact with for 12 hours. If its important I'll open it again!

MetaWhirledPeas 7 months ago | parent | prev | next [-]

> The app isn't the point, the starting fresh is.

This is the crux of what made Google Inbox so good. The UX encouraged "deleting" and starting fresh. This was made possible not just through the controls, but also through the promise that undelete would be possible. People want to start fresh, but they also don't want to lose anything; that's the conundrum.

loa_in_ 7 months ago | parent | prev | next [-]

Just add a __delete all__ button.

mycall 7 months ago | parent | next [-]

My solution to most task list items is just wait long enough. Eventually the items often are no longer needed.

mpweiher 7 months ago | parent | prev [-]

Yeah, that should work.

Alas, it doesn't.

rpier001 7 months ago | parent | prev [-]

We need a to-do app following these principles... then another for product management.

mike_hearn 7 months ago | parent | prev | next [-]

What about LSM trees? Something like RocksDB is very efficient at deleting. A delete operation is a tiny write (a tombstone) and then the actual deletion is done via compaction in the background with entire tablets being freed at once when the live data is evacuated. It's actually most efficient when deleting large ranges - when just replacing data it's not so efficient due to the write amplification.

That said, I agree with you in general. An under-used technique is to simply encrypt everything and then use a key store that is guaranteed capable of deleting data. This makes it easy to comply with legal deletion requests even across backups, though of course, you need to manage the keys very carefully and ensure that they are also backed up.

jandrewrogers 7 months ago | parent | next [-]

Encryption that allows precise deletion of records in databases is quite famously pathological for performance and cost. Databases that work this way have existed for many decades and almost no one uses them because the performance is unacceptably terrible.

The reason may not be obvious. At the limit, database data structures and algorithms are a collection of data compression algorithms, though we typically don’t think of databases this way. Using encryption in front of that compression renders it worthless, and most database performance is predicated on the ability to use compressive representations of the data. Encrypting the data forces the most naive and inefficient data structures imaginable.

mike_hearn 7 months ago | parent [-]

Yes, encrypted databases aren't that useful for fine grained data especially if you get into academic computable encryption schemes.

For document DBs where you're storing files like big JSON structures, PDFs, or for archival data, etc it can work out. Though mostly it's not worth it because key management is too hard.

latency-guy2 7 months ago | parent | prev | next [-]

Well tombstoning is fundamentally punting the operation, the data is still there taking up space and computation if the flagged entry does not get removed from varying levels of query plans.

I agree that it meets the requirements for batched DELETE, and that's likely as best as we can make it.

But I wonder if there was a better way. I know there are research DBs out there that experimented with reusing the tombstone entry for new INSERT/UPDATE operations, but these suck when you want to do batched INSERT/UPDATE on a range since they're scattered all about in a table, and you lose ordering + monotonic properties.

mike_hearn 7 months ago | parent [-]

The way tombstones work in a sorted KV store like RocksDB is that queries walk up the levels, and the moment a tomb stone is hit the walk stops because it's known the keys are deleted. Then when the levels are compacted the live data is evacuated into a new tablet, so it's like generational GC. The cost scales with live data, not how much data there is in total. The problem of course is you pay that cost over and over again.

selecsosi 7 months ago | parent | prev [-]

+1 This was our strategy at TempoIQ for our ts storage engine (built on top of rocks).

Very efficient and effective at managing tons of data ingestion (and deletion) at scale.

Not an easy out of the box tech to build on top of though when you have to build all the analytics/management pieces that something like PG gets you so I get the lack of public examples

epistasis 7 months ago | parent | prev | next [-]

There was a common thread through all of my algorithms and data structures class: Hash maps, b-trees, etc. are all beautiful structures until you add a delete operation and have to start dealing with all those little holes... Removing data complicates everything.

toast0 7 months ago | parent | prev | next [-]

> We don’t actually have any computer science for DELETE optimized databases.

Depending on how much deleting and when, there might be engineering if not science for this.

If everything is deleted on a schedule, partitioned databases and dropping whole partitions as they expire is a well worn path. Soft delete and compaction also works pretty well if most, but not all things will be deleted. A generational garbage collection kind of thing.

As others said, fixed sized records are easier to manage deletion/replacement with, too.

kccqzy 7 months ago | parent | prev | next [-]

It's absolutely true that we don't think about it much. When I was first taught balanced binary trees, deletion wasn't even a topic that needed to be learned. Same thing later when I was taught balanced binary trees. Then again in hash tables. It's an operation that's overlooked in CS education.

akira2501 7 months ago | parent | prev | next [-]

> We don’t actually have any computer science for DELETE optimized databases.

There is actually a fair amount if you consider databases with fixed length records. Which used to be the dominant form.

redox99 7 months ago | parent | prev | next [-]

I think garbage collection memory management can be thought of a delete optimized database.

tybit 7 months ago | parent | next [-]

Runtimes with garbage collectors typically optimize for allocation, not deletion.

mike_hearn 7 months ago | parent [-]

Generational GC optimizes for both. They assume that most objects die young, so choose to relocate live objects and just mark the entire region that was evacuated as empty. So this is a very efficient way to delete data.

actionfromafar 7 months ago | parent | prev | next [-]

I have long day-dreamed of what a “use it (soon) or loose it” runtime would mean. Allocated blocks would just expire after a set time.

okasaki 7 months ago | parent [-]

So kind of like browsers work now with the tab unloading and Android killing apps?

Personally I find it really obnoxious and disruptive.

sitkack 7 months ago | parent [-]

You would prefer something a little more persistent?

dotancohen 7 months ago | parent | next [-]

I would prefer a 48 hour day, so that I could get everything done that needs to be done.

Or maybe a 72 hour day, so I'd have time for things that I'd just enjoy.

okasaki 7 months ago | parent | prev [-]

I prefer to do my own app lifecycle management.

SgtBastard 7 months ago | parent | prev [-]

Only if you profoundly misunderstand what GC is.

hinkley 7 months ago | parent | prev | next [-]

If someone dumped that project in my lap and said fix it (and it I was more used to low level programming), I’d probably start be refreshing myself on the last 10+ years of GC advances since I stopped reading SIGPLAN. Particularly multithreaded sweep. Because essentially you want to decouple delete from free so you can not do 100% of the bookkeeping work in the middle of time sensitive operations. But not fall as far behind as Postgres can with its vacuuming albatross.

In a way, deletion is a form of eventual consistency. The user loses access to the data but the system still knows about it for a while.

Just off the top of my head, I would think for LSM systems, you would resort to snapshotting as the edit history became much larger than the retained row count, and as you delete old snapshots (two GC roots) you could compare the old and the new and drop everything that didn’t survive. You only have to finish well before the next snapshot interval, and if you maintain a queue you only have to process them on average as fast as the snapshot interval.

And for BTree systems you can amortize the deletes across every insert, the way some realtime systems clean up a few free pointers on every allocation.

brightball 7 months ago | parent | prev | next [-]

Yep, it’s tough. One of the more unexpectedly complicated aspects of GDPR compliance is the strain that full deletes put on a system.

You have to break it into small batches across the cascade of associated data.

throwaway984393 7 months ago | parent | prev [-]

[dead]