> And yes, it's a well-known trick for all major relational databases (not just Postgres) that if you want to delete 90% of rows from a large a table, it's much faster to just copy the rows you want to keep to a new table, run DROP TABLE on the old table, and rename the new table to the old table.

Dumb question but why does the optimizer not just do that in secret then? Seems like something that should be detectable with some heuristics.

▲

crazygringo 5 hours ago | parent | next [-]

Because what do you do if rows are being inserted in the original table, while the new table is having rows copied over? You'll get missing rows.

You can only do the DROP TABLE trick if you know nothing else is writing to the table at the same time. You know if that's the case, according to your business logic. The database has no idea.

The DROP TABLE trick effectively bypasses all the normal guarantees of data consistency. This is why it's so fast. But you have to know that that's a safe thing to do for your data.

▲

nostrademons 4 hours ago | parent [-]

There are ways the DB could recover the data consistency guarantees, eg. keeping a log of operations that came in while the table was being copied over and then applying the relevant ones afterwards.

The tricky part is that the latency characteristics of these operations would be pretty surprising and unintuitive. It has the same problems as virtual memory and mark/sweep GC; sometimes, depending on system state and things that other threads are doing, an unrelated operation might block for very long time periods and give you huge user-visible pauses. It's often better to force these expensive operations to be explicit so that the developer has to think through the latency & consistency implications and make the tradeoffs they want.

	▲	convolvatron 3 hours ago \| parent [-]
		except in this particular case, as long as you don't exhaust resources, mvcc kind of lets you get away with making a transactionally consistent copy under the covers without blocking anyone, since its a big-ol read.

▲

sgarland 5 hours ago | parent | prev | next [-]

I assume partly because that would be extremely surprising behavior, and depending on the RDBMS and version, could introduce unexpected stalls. For example, MySQL < 8.0.23 scans the entire buffer pool to clear pages that were dropped, which can take a long time on large instances. There is / was a similar issue with its adaptive hash index, which AFAIK wasn’t ever fixed, though AHI’s default being shifted to OFF in 8.4 is a workaround, in a very hacky way.

▲

Retr0id 5 hours ago | parent | prev | next [-]

Maintaining the expected observable behaviours would get complicated if queries (especially other updates) against the same table are happening concurrently.

▲

layer8 3 hours ago | parent | prev | next [-]

Because dropping a table effectively requires an exclusive lock on the table during that whole operation, affecting parallel transactions.

▲

5 hours ago | parent | prev | next [-]

[deleted]

▲

mordae 5 hours ago | parent | prev [-]

It drops dependents.