▲ | abelanger a day ago | |||||||
I'm not sure of the exact threshold, but the pathological case seemed to be (1) many tasks in the backlog, (2) many workers, (3) workers long-polling the task tables at approximately the same time. This would consistently lead to very high spikes in CPU and result in a runaway deterioration on the database, since high CPU leads to slower queries and more contention, which leads to higher connection overhead, which leads to higher CPU, and so on. There are a few threads online which documented very similar behavior, for example: https://postgrespro.com/list/thread-id/2505440. Those other points are mostly unrelated to the core queue, and more related to helper tables for monitoring, tracking task statuses, etc. But it was important to optimize these tables because unrelated spikes on other tables in the database could start getting us into a deteriorated state as well. To be more specific about the solutions here: > buffered reads and writes To run a task through the system, we need to write the task itself, write the instance of that retry of the count to the queue, write an event that the task has been queued, started, completed | failed, etc. Generally one task will correspond to many writes along the way, not all of which need to be extremely latency sensitive. So we started buffering items coming from our internal queues and flushing them once every 10ms, which helped considerably. > switching all high-volume tables to use identity columns We originally had combined some of our workflow tables with our monitoring tables -- this table was called `WorkflowRun` and it was used for both concurrency queues and queried when serving the API. This table used a UUID as the primary key, because we wanted UUIDs over the API instead of auto-incrementing IDs. The UUIDs caused some headaches down the line when trying to delete batches of data and prevent index bloat. | ||||||||
▲ | chaz6 9 hours ago | parent | next [-] | |||||||
Out of interest, did you try changing the value of commit_delay? This parameter allows multiple transactions to be written together under heavy load. | ||||||||
| ||||||||
▲ | diarrhea 18 hours ago | parent | prev [-] | |||||||
Thank you! Very insightful, especially the forum link and the observation around UUIDs bloating indexes. |