AIUI in that thread they're saying "0.51x" the perf on a 96-core arm64 machine and they're also saying they cannot reproduce it on a 96-core amd64 machine.

So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.

That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.

▲ torginus 10 hours ago | parent | next [-]

It's a huge issue of ARM based systems, that hardly anyone uses or tests things on them (in production).

Yes, Macs going ARM has been a huge boon, but I've also seen crazy regressions on AWS Graviton (compared to how its supposed to perform), on .NET (and node as well), which frankly I have no expertise or time digging into.

Which was the main reason we ultimately cancelled our migration.

I'm sure this is the same reason why its important to AWS.

	▲	p_l 7 hours ago \| parent [-]
		Macs are actually part of pain point with ARM64 Linux, because the Linux arm set er tend to use 64 kB pages while Mac supports only 4 and 16, and it causes non trivial bugs at times (funnily enough, I first encountered that in a database company...)

▲ zamalek 13 hours ago | parent | prev | next [-]

It was later reproduced on the same machine without huge pages enabled. PICNIC?

▲

anarazel 13 hours ago | parent [-]

Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.

With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.

▲

pgaddict 5 hours ago | parent [-]

So why does it happen only with hugepages? Is the extra overhead / TLB pressure enough to trigger the issue in some way? Of is it because the regular pages get swapped out (which hugepages can't be)?

	▲	anarazel 5 hours ago \| parent [-]
		I don't fully know, but I suspect it's just that due to the minor faults and tlb misses there is terrible contention with the spinlock, regardless of the PREEMPT_LAZY when using 4k pages (that easily reproducible). Which is then made worse by preempting more with the lock held.

▲ MBCook 14 hours ago | parent | prev | next [-]

So perhaps this is a regression specifically in the arm64 code, or said differently maybe it’s a performance bug that has been there for a long time but covered up by the scheduler part that was removed?

▲

adrian_b 10 hours ago | parent | next [-]

The following messages concluded that using huge pages mitigates the regression, while not using huge pages reproduces it.

▲

db48x 13 hours ago | parent | prev [-]

Could be either of those, or something else entirely. Or even measurement error.

▲

jeltz 9 hours ago | parent [-]

Turns out the amd machine had huge tables enabled and after disabling those the regression was there on and too. So arm vs amd was a red herring.

Of course not a nice regression but you should not run PostgreSQL on large servers without huge pages enabled so thud regression will only hurt people who have a bad configuration. That said I think these bad configurations are common out there, especially in containerized environments where the one running PostgreSQL may not have the ability to enable huge pages.

▲

whizzter 9 hours ago | parent | next [-]

Still that huge a regression that affects multiple platforms doesn't sound too neat, did they narrow down the root cause?

	▲	db48x an hour ago \| parent [-]
		That should be obvious to anyone who read the initial message. The regression was caused by a configuration change that changed the default from PREEMPT_NONE to PREEMT_LAZY. If you don’t know what those options do, use the source. (<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...>)

▲

db48x 8 hours ago | parent | prev [-]

Yes, I had a good laugh at that. It might technically be a regression, but not one that most people will see in practice. Pretty weird that someone at Amazon is bothering to run those tests without hugepages.

▲

scottlamb 3 hours ago | parent [-]

I doubt they explicitly said "I'll run without huge pages, which is an important AWS configuration". They probably just forgot a step. And "someone at Amazon" describes a lot of people; multiply your mental probability tables accordingly.

	▲	db48x an hour ago \| parent [-]
		The number of people at Amazon is pretty much irrelevant; the org is going to ensure that someone is keeping an eye on kernel performance, but also that the work isn’t duplicative. Surely they would be testing the configuration(s) that they use in production? They’re not running RDS without hugepages turned on, right?

▲ master_crab 15 hours ago | parent | prev [-]

For production Postgres, i would assume it’s close to almost no effect?

If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.

▲ crcastle 15 hours ago | parent | next [-]

Ubuntu is used in many serious backend environments. Heroku runs tens of thousands (if not more) instances of Ubuntu on its fleet. Or at least it did through the teens and early 2020s.

https://devcenter.heroku.com/articles/stack

▲ rixed 11 hours ago | parent | next [-]

There is serious as in "corporate-serious" and serious as in "engineer-serious".

	▲	zbentley an hour ago \| parent [-]
		I’ve seen more 5k+-core fleets running Ubuntu in prod than not, in my career. Industries include healthcare, US government, US government contractor, marketing, finance.

▲ nine_k 15 hours ago | parent | prev [-]

Do they upgrade to the new LTS the day it is released?

▲ sakjur 9 hours ago | parent | next [-]

Ubuntu's upgrade tools wait until the .1 release for LTSes, so your typical installation would wait at least half a year.

▲ crcastle 14 hours ago | parent | prev [-]

Not historically.

▲ rvnx 14 hours ago | parent [-]

and they are right, this is because a lot of junior sysadmins believe that newer = better.

But the reality:

  a) may get irreversible upgrades (e.g. new underlying database structure) 
  b) permanent worse performance / regression (e.g. iOS 26)
  c) added instability
  d) new security issues (litellm)
  e) time wasted migrating / debugging
  f) may need rewrite of consumers / users of APIs / sys calls
  g) potential new IP or licensing issues

etc.

A couple of the few reasons to upgrade something is:

  a) new features provide genuine comfort or performance upgrade (or... some revert)
  b) there is an extremely critical security issue
  c) you do not care about stability because reverting is uneventful and production impact is nil (e.g. Claude Code)

but 99% of the time, if ain't broke, don't fix it.

https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_ou...

	▲	miki123211 11 hours ago \| parent \| next [-]
		On the other hand, I suspect LLMs will dramatically decrease the window between a vulnerability being discovered and that vulnerability being exploited in the wild, especially for open-source projects. Even if the vulnerability itself is discovered through other means than by an LLM, it's trivial to ask a SOTA model to "monitor all new commits to project X and decide which ones are likely patching an exploitable vulnerability, and then write a PoC." That's a lot easier than finding the vulnerable itself. I won't be surprised if update windows (for open source networked services) shrink to ~10 minutes within a year or two. It's going to be a brutal world.
	▲	mr_toad 7 hours ago \| parent \| prev \| next [-]
		Too often I see IT departments use this as an excuse to only upgrade when they absolutely have to, usually with little to no testing in advance, which leaves them constantly being back-footed by incompatibility issues. The idea of advanced testing of new versions of software (that they’ll be forced to use eventually) never seems to occur, or they spend so much time fighting fires they never get around to it.
	▲	gjvc 9 hours ago \| parent \| prev [-]
		all fair points, on the other hand, as a general rule, isn't it important to stay on currently-supported versions of pieces of software that you run? ymmv, but in my experience projects like postgresql which have been reliable, tend to continue to be so.

▲ pmontra 12 hours ago | parent | prev [-]

A customer of mine is running on Ubuntu 22.04 and the plan is to upgrade to 26.04 in Q1 2027. We'll have to add performance regression to the plan.

	▲	wongogue 10 hours ago \| parent [-]
		Are you running ARM servers?