Remix.run Logo
msdrigg 11 hours ago

The most interesting thing for me in this article was the mention of `MERGE` almost in passing at the end.

> I'm not a big fan of using the constraint names in SQL, so to overcome both limitations I'd use MERGE instead:

``` db=# MERGE INTO urls t USING (VALUES (1000004, 'https://hakibenita.com')) AS s(id, url) ON t.url = s.url WHEN MATCHED THEN UPDATE SET id = s.id WHEN NOT MATCHED THEN INSERT (id, url) VALUES (s.id, s.url); MERGE 1 ```

I use `insert ... on conflict do update ...` all the time to handle upserts, but it seems like merge may be more powerful and able to work in more scenarios. I hadn't heard of it before.

gshulegaard 10 hours ago | parent | next [-]

IIRC `MERGE` has been part of SQL for a while, but Postgres opted against adding it for many years because it's syntax is inherently non-atomic within Postgres's MVCC model.

https://pganalyze.com/blog/5mins-postgres-15-merge-vs-insert...

This is somewhat a personal preference, but I would just use `INSERT ... ON CONFLICT` and design my data model around it as much as I can. If I absolutely need the more general features of `MERGE` and _can't_ design an alternative using `INSERT ... ON CONLFICT` then I would take a bit of extra time to ensure I handle `MERGE` edge cases (failures) gracefully.

kbolino 8 hours ago | parent | next [-]

It's kinda hard to handle MERGE failures gracefully. You generally expect the whole thing to succeed, and the syntax deceptively makes it seem like you can handle all the cases. But because of MVCC, you get these TOCTOU-style spurious constraint violations, yet there's no way to address them on a per-row basis, leading to the entire statement rolling back even for the rows that had no issues. If you are designing for concurrent OLTP workloads against the table, you should probably just avoid MERGE altogether. It's more useful for one-off manual fixups.

da_chicken 23 minutes ago | parent [-]

I'm not sure why you'd expect partial updates of a single statement in the first place. I mean, if I run `UPDATE Account SET Status = 'Closed' WHERE LastAccess < NOW() - INTERVAL '90 days';`, I'm not going to be happy if there's 50 records that match, the DB updates 30 successfully, and then error on 20. Atomic isn't just about rows. Do all the work or none of it.

If you're experiencing things that smell like TOCTOU, first you need to be sure you don't have oddball many-to-one issues going on (i.e., a cardinality violation error), and then you're going to have to increase your transaction isolation level to eliminate non-repeatable reads and phantom reads.

Like, the alternative to a MERGE is writing a few UPDATE statements followed by an INSERT and wrapping the entire batch in a transaction. And you should likely still wrap the whole thing in a transaction. If it breaks, you just replay the whole thing. Re-run the whole job.

awesome_dude 8 hours ago | parent | prev [-]

That reference - my initial gut feeling was that `MERGE` felt more readable, but then I read this paragraph

> If you want the generality of MERGE, you have to accept the fact that you might get unique constraint violations, when there are concurrent inserts, versus with INSERT ON CONFLICT, the way it's designed with its speculative insertions, guarantees that you either get an INSERT or an UPDATE and that is true even if there are concurrent inserts. You might want to choose INSERT ON CONFLICT if you need the guarantee.

Basically, `MERGE` is susceptible to a concurrent process also writing `INSERT` where that `INSERT` and `MERGE` are unaware of one another, causing a duplicate value to be used.

philjohn 10 hours ago | parent | prev [-]

If you're doing large batch inserts, I've found using the COPY INTO the fastest way, especially if you use the binary data format so there's no overhead on the postgres server side.

sirfz 9 hours ago | parent [-]

That doesn't work well with conflicts tho iirc