Remix.run Logo
abelanger a day ago

I'm not sure of the exact threshold, but the pathological case seemed to be (1) many tasks in the backlog, (2) many workers, (3) workers long-polling the task tables at approximately the same time. This would consistently lead to very high spikes in CPU and result in a runaway deterioration on the database, since high CPU leads to slower queries and more contention, which leads to higher connection overhead, which leads to higher CPU, and so on. There are a few threads online which documented very similar behavior, for example: https://postgrespro.com/list/thread-id/2505440.

Those other points are mostly unrelated to the core queue, and more related to helper tables for monitoring, tracking task statuses, etc. But it was important to optimize these tables because unrelated spikes on other tables in the database could start getting us into a deteriorated state as well.

To be more specific about the solutions here:

> buffered reads and writes

To run a task through the system, we need to write the task itself, write the instance of that retry of the count to the queue, write an event that the task has been queued, started, completed | failed, etc. Generally one task will correspond to many writes along the way, not all of which need to be extremely latency sensitive. So we started buffering items coming from our internal queues and flushing them once every 10ms, which helped considerably.

> switching all high-volume tables to use identity columns

We originally had combined some of our workflow tables with our monitoring tables -- this table was called `WorkflowRun` and it was used for both concurrency queues and queried when serving the API. This table used a UUID as the primary key, because we wanted UUIDs over the API instead of auto-incrementing IDs. The UUIDs caused some headaches down the line when trying to delete batches of data and prevent index bloat.

chaz6 9 hours ago | parent | next [-]

Out of interest, did you try changing the value of commit_delay? This parameter allows multiple transactions to be written together under heavy load.

nyrikki 6 hours ago | parent [-]

IMHO, with this type of issue is often more likely blowing through the multixact cache or the query planner reverting to SEQSCAN due to the number of locks or mxact id exaustion etc.. It is most likely not a WAL flush problem that commit_delay would help with.

From the above link:[1]

> I found that performing extremely frequent vacuum analyze (every 30 minutes) helps a small amount but this is not that helpful so problems are still very apparent.

> The queue table itself fits in RAM (with 2M hugepages) and during the wait, all the performance counters drop to almost 0 - no disk read or write (semi-expected due to the table fitting in memory) with 100% buffer hit rate in pg_top and row read around 100/s which is much smaller than expected.

Bullet points 2 and 3 from here [2] are what first came to mind, due to the 100% buffer hit rate.

Note that vacuuming every 30min provided "minor improvements" but the worst case of:

     25000 tps * 60sec *30min * 250rows == 11,250,000,000 ID's (assuming worst case every client locking conflicting rows)
Even: 25000tps 60sec 30min

Is only two orders of magnitude away from blowing through the 32bit transaction ID's.

    45,000,000
    4,294,967,296
But XID exhaustion is not as hidden as the MXID exhaustion and will block all writes, while the harder to see MXID exhaustion will only block some writes.

IMHO, if I was writing this, and knowing that you are writing an orchestration platform, getting rid of the long term transactions with just a status column would be better, row level locks are writing to the row anyways, actually twice.

    tuple lock -> write row lock to xmax column -> release tuple lock.
Long lived transactions are always problematic for scaling, and that status column would allow for more recovery options etc...

But to be honest, popping off the left of a red black tree like the linux scheduler does is probably so much better than fighting this IMHO.

This opinion is assuming I am reading this right from the linked to issue [1]

> SELECT FOR UPDATE SKIP LOCKED executes and the select processes wait for multiple minutes (10-20 minutes) before completing

There is a undocumented command pg_get_multixact_members() [3] that can help troubleshoot as many people are using hosted Postgres, the tools too look into the above problems can be limited.

It does appear that Amazon documents a bit about the above here [4].

[1] https://postgrespro.com/list/thread-id/2505440 [2] https://www.postgresql.org/docs/current/routine-vacuuming.ht... [3] https://doxygen.postgresql.org/multixact_8c.html#adf3c97f22b... [4] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

diarrhea 18 hours ago | parent | prev [-]

Thank you! Very insightful, especially the forum link and the observation around UUIDs bloating indexes.