Remix.run Logo
munk-a 6 hours ago

We've leveraged the atomicity of transactions with a fail-safe approach for external service interactions for client email sending. This could certainly be done with a formal queue though it'd operate very similarly and achieve the same guarantees as we have today (and was built when we were too small to justify such an infra spend). Internally we have jobs that execute complex logic to transform data from a pending state to a computed state which lean on the DB's atomicity to guarantee that data is successfully transitions and those tasks are all incredibly resilient - but when a secondary persistence store is involved transactional guarantees need to be compromised in some manner. In our email sending example we have the opinion that it is more important to guarantee a client receives all notifications compared to a notification being guaranteed to be sent precisely once so our mechanism in sending is to confirm email sending was successful and then close a transaction that removes that message from the pending list.

There will always be a window for potential loss due to solar flares/whatever but the key in designing a system like this is to make sure you're aware of how the system can fail, accept that outcome and then work to, as much as possible, shrink the distance in cycles/logic between each persistence committal. Logic should be front-loaded to do as much prep work as possible before any irreversible actions happen and then those irreversible actions should be ordered to your preference and dispatched as quickly and cheaply as possible in a safe manner.