Only by a weird definition of "scalable". The first sentence says:

> Counterintuitively, large DELETEs add work to the database.

There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.

And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table. Since DROP TABLE is ~instantaneous, mainly involving table-level metadata.

DELETE scales just fine, in the sense that if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT.

Basic database functionality is designed around the assumption of lots of small transactions. Whenever you have to do something involving millions of rows at once, you generally need to investigate solutions that work well in "bulk". E.g. loading rows directly from a file rather than with SQL, adding indices only after the data has been loaded rather than before, disabling foreign key checks on large operations (if you know by design that the keys are valid)... and yes, taking advantage of DROP TABLE instead of DELETE. This doesn't mean small transactions aren't scalable, it just means bulk operations are qualitatively different and benefit from their own solutions. And DELETE is no different from INSERT in this regard.

▲

setr 7 hours ago | parent | next [-]

> It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it? Obviously you have to do almost all the same operations: write a log, write the deletion, update indices, replicate it, etc.

It takes far more work to delete/update than insert. My recent example is updating ~2TB of text data was about 40x slower than inserting 12TB (was trying to correct some large text truncation that occurred during migration into PG, ended up being faster to redo).

▲

crazygringo 7 hours ago | parent [-]

> It takes far more work to delete/update than insert.

Updating rows of text data is going to be more work, because variable-length text can't be updated in-place. So in terms of allocating space, it's more like a delete plus an insert. That's not surprising. (An in-place update that doesn't touch indices is generally going to be faster than an insert, though.)

I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.

▲

gopalv 6 hours ago | parent | next [-]

> So in terms of allocating space, it's more like a delete plus an insert.

Unless you're using zHeap, you have a narrow Heap-only-Tuples scenario where the indexes stay the same. TOAST kinda helps there, if the update is off the tuple area itself. The original zHeap docs have a lot of detail about why an UNDO log can help with long running transactions from the past etc.

That is a postgresql specific thing though. Mysql indexes were created with the idea of different storage engines in mind, so Mysql doesn't suffer from the index update ovehead on update/delete the same way.

Uber had a long blog post about switching to Mysql from Postgres for wide tables with hundreds of indexes. The HN entry is still there[1], but I can't read the original post now.

As a side note, I've used postgres partitions to the same effect to drop old data periodically - detach and then drop the partition instead of a direct DELETE (similar tricks in HBase existed).

[1] - https://news.ycombinator.com/item?id=10894047

▲

fidotron 6 hours ago | parent | prev | next [-]

> I'm not aware of instances where a delete is "far more work" than an equivalent insert though. That's not the general case, and I'm having a hard time thinking of any situations where that would be true.

Transactionally across related items with constraints it can explode fast.

If you've ever used FoundationDB this rapidly becomes the defining PITA due to the transaction size limits. Adding/inserting/updates are all far more predictably bounded.

▲

tremon 4 hours ago | parent [-]

But in that case, you need to compare like-for-like with the situation where you need to insert all the prerequisite rows too. You can't just compare a delete cascade with a single insert where all the foreign keys are already satisfied.

▲

fidotron 2 hours ago | parent [-]

The whole problem with the delete cascade is you can't tell how big it will be until you have entered the transaction to do it. An insert you either know or it will fail and you can retry.

▲

tremon an hour ago | parent [-]

That's true, but now you have moved the goalposts. The original claim upthread was "it takes just as much work to delete a row as it takes to insert a row", not "it's hard to predict the performance of a delete with cascade effects". And the obvious rebuttal to that is that it's equally hard to provide an upper bound for the runtime of a single insert: an application cannot control the other processes running on the database, some of which may delay, interfere with or even invalidate your query and you must account for that. A delete operation is just as much "it might fail and you can retry" as an insert, or the database you're working with isn't ACID-compliant.

▲

fidotron an hour ago | parent [-]

> And the obvious rebuttal to that is that it's equally hard to provide an upper bound for the runtime of a single insert

This is precisely where you're going wrong. The insert is upper boundable in advance (you know the set of everything you might potentially have to insert), the delete isn't because you don't know what's in the db until you look.

I strongly recommend poking around with Foundation for this, because it becomes clear that this problem is the defining flaw with the way they tried to architect with layers, to the point they have a queuing system for processing large jobs of this type.

	▲	ElectricalUnion 16 minutes ago \| parent [-]
		> The insert is upper boundable in advance A concurrent DML happening then suddenly your MERGE INTO WHEN NOT MATCHED INSERT/INSERT INTO SELECT is way larger that you thought? I thought "some workloads can suddenly be way larger that I expected" was supposed to be a thing in all non-trivial DML.

▲

pixl97 6 hours ago | parent | prev [-]

Not directly database related, but when it comes to writing files on disks, deletes on SSDs can be rather expensive because of the delete block size vs a simple write.

	▲	2 hours ago \| parent [-]
		[deleted]

▲

echoangle 7 hours ago | parent | prev | next [-]

> And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table.

Dumb question but why does the optimizer not just do that in secret then? Seems like something that should be detectable with some heuristics.

▲

crazygringo 7 hours ago | parent | next [-]

Because what do you do if rows are being inserted in the original table, while the new table is having rows copied over? You'll get missing rows.

You can only do the DROP TABLE trick if you know nothing else is writing to the table at the same time. You know if that's the case, according to your business logic. The database has no idea.

The DROP TABLE trick effectively bypasses all the normal guarantees of data consistency. This is why it's so fast. But you have to know that that's a safe thing to do for your data.

▲

nostrademons 6 hours ago | parent [-]

There are ways the DB could recover the data consistency guarantees, eg. keeping a log of operations that came in while the table was being copied over and then applying the relevant ones afterwards.

The tricky part is that the latency characteristics of these operations would be pretty surprising and unintuitive. It has the same problems as virtual memory and mark/sweep GC; sometimes, depending on system state and things that other threads are doing, an unrelated operation might block for very long time periods and give you huge user-visible pauses. It's often better to force these expensive operations to be explicit so that the developer has to think through the latency & consistency implications and make the tradeoffs they want.

	▲	convolvatron 5 hours ago \| parent [-]
		except in this particular case, as long as you don't exhaust resources, mvcc kind of lets you get away with making a transactionally consistent copy under the covers without blocking anyone, since its a big-ol read.

▲

sgarland 7 hours ago | parent | prev | next [-]

I assume partly because that would be extremely surprising behavior, and depending on the RDBMS and version, could introduce unexpected stalls. For example, MySQL < 8.0.23 scans the entire buffer pool to clear pages that were dropped, which can take a long time on large instances. There is / was a similar issue with its adaptive hash index, which AFAIK wasn’t ever fixed, though AHI’s default being shifted to OFF in 8.4 is a workaround, in a very hacky way.

▲

Retr0id 7 hours ago | parent | prev | next [-]

Maintaining the expected observable behaviours would get complicated if queries (especially other updates) against the same table are happening concurrently.

▲

layer8 5 hours ago | parent | prev | next [-]

Because dropping a table effectively requires an exclusive lock on the table during that whole operation, affecting parallel transactions.

▲

7 hours ago | parent | prev | next [-]

[deleted]

▲

mordae 7 hours ago | parent | prev [-]

It drops dependents.

▲

coldtea 2 hours ago | parent | prev | next [-]

>There is nothing counterintuitive about this. It takes just as much work to delete a row as it takes to insert a row. Why wouldn't it?

Because e.g. DROP also effectively deletes the rows but takes way way less work.

▲

Retr0id 7 hours ago | parent | prev | next [-]

> if you are constantly inserting and deleting individual rows, DELETE scales the same as INSERT

Technically correct, but for a small table with a high churn rate, the performance characteristics may be surprising in that the "n" in most big-O calculations includes all inserts since the last VACUUM, not the actual number of resident rows.

▲

winterbloom 7 hours ago | parent | prev [-]

how does that solution work if the table that is dropped has foreign key constraints?

	▲	crazygringo 7 hours ago \| parent [-]
		That's why I said: > disabling foreign key checks on large operations And you have to know that, according to your business logic, what you're doing is safe.