Mastodawn

crcastle 2d ago

AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy

https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

https://www.phoronix.com/news/Linux-7.0-AWS-PostgreSQL-Drop

Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default - Andres Freund

Show thread

lfittl

Its worth reading this follow-up LKML post by Andres Freund (who works on Postgres): https://lore.kernel.org/lkml/yr3inlzesdb45n6i6lpbimwr7b25kqk...

Re: [PATCH 0/1] sched: Restore PREEMPT_NONE as default - Andres Freund

Show thread

jeffbee 2d ago

Funny how "use hugepages" is right there on the table and 99% of users ignore it.

Show thread

bombcar 2d ago

I’m absolutely flabbergasted by the performance left on the table; even by myself - just yesterday I learned Gentoo’s emerge can use git and be a billion times faster.

Show thread

TacticalCoder 2d ago

AIUI in that thread they're saying "0.51x" the perf on a 96-core arm64 machine and they're also saying they cannot reproduce it on a 96-core amd64 machine.

So it's not going to affect everybody both running PostgreSQL and upgrading to the latest kernel. Conditions seems to be: arm64, shitloads of core, kernel 7.0, current version of PostgreSQL.

That is not going to be 100% of the installed PostgreSQL DBs out there in the wild when 7.0 lands in a few weeks.

Show thread

master_crab 2d ago

For production Postgres, i would assume it’s close to almost no effect?

If someone is running postgres in a serious backend environment, i doubt they are using Ubuntu or even touching 7.x for months (or years). It’ll be some flavor of Debian or Red Hat still on 6.x (maybe even 5?). Those same users won’t touch 7.x until there has been months of testing by distros.

Show thread

crcastle 2d ago

Ubuntu is used in many serious backend environments. Heroku runs tens of thousands (if not more) instances of Ubuntu on its fleet. Or at least it did through the teens and early 2020s.

https://devcenter.heroku.com/articles/stack

Stacks | Heroku Dev Center

A Heroku stack is a build and deployment environment, maintained by Heroku to simplify devops.

Show thread

nine_k 2d ago

Do they upgrade to the new LTS the day it is released?

Not historically.

and they are right, this is because a lot of junior sysadmins believe that newer = better.

But the reality:

  a) may get irreversible upgrades (e.g. new underlying database structure) 
  b) permanent worse performance / regression (e.g. iOS 26)
  c) added instability
  d) new security issues (litellm)
  e) time wasted migrating / debugging
  f) may need rewrite of consumers / users of APIs / sys calls
  g) potential new IP or licensing issues

etc.

A couple of the few reasons to upgrade something is:

  a) new features provide genuine comfort or performance upgrade (or... some revert)
  b) there is an extremely critical security issue
  c) you do not care about stability because reverting is uneventful and production impact is nil (e.g. Claude Code)

but 99% of the time, if ain't broke, don't fix it.

https://en.wikipedia.org/wiki/2024_CrowdStrike-related_IT_ou...

2024 CrowdStrike-related IT outages - Wikipedia

Show thread

pmontra 1d ago

A customer of mine is running on Ubuntu 22.04 and the plan is to upgrade to 26.04 in Q1 2027. We'll have to add performance regression to the plan.

Show thread

MBCook 2d ago

So perhaps this is a regression specifically in the arm64 code, or said differently maybe it’s a performance bug that has been there for a long time but covered up by the scheduler part that was removed?

Show thread

zamalek 1d ago

It was later reproduced on the same machine without huge pages enabled. PICNIC?

Show thread

anarazel 1d ago

Yes, I did reproduce it (to a much smaller degree, but it's just a 48c/96t machine). But it's an absurd workload in an insane configuration. Not using huge pages hurts way more than the regression due to PREEMPT_LAZY does.

With what we know so far, I expect that there are just about no real world workloads that aren't already completely falling over that will be affected.

Show thread

justinclift 2d ago

Note that it's just not a single post, and there's additional further information in following the full thread. :)

Show thread

aftbit 2d ago

>If this somehow does end up being a reproducible performance issue (I still
suspect something more complicated is going on), I don't see how userspace
could be expected to mitigate a substantial perf regression in 7.0 that can
only be mitigated by a default-off non-trivial functionality also introduced
in 7.0.