software is amazing... my otherwise-idle 128 GB machine just started swapping and then invoked the OOM killer to nuke a Debug + code coverage LLVM build
@regehr how much parallelism?
@whitequark @regehr build system which gracefully handles memory pressure any day now!
@dotstdy @regehr is this even a build system's job? I can see both answers being appropriate, but if "yes" then you'll have to deal with a lot of pain making a cross-platform one
@whitequark @dotstdy @regehr ninja has a concept of "pools" for exactly this sort of thing: limiting the number of concurrent linker invocations: https://ninja-build.org/manual.html#ref_pool
The Ninja build system

@tedmielczarek @whitequark @regehr yeah but that's very much the "problem is users own" approach, since the value for every concurrency limit needs to change depending on the available hardware resources.
@dotstdy @tedmielczarek @regehr but we don't have a way to predict how much memory a linker invocation will consume, do we? in which case that's sort of what we're stuck with, since any solution must necessarily be reactive
@whitequark @tedmielczarek @regehr I think tuning it the other way (i.e. this invocation requires 2gb) and having the job server track a budget, delaying launch (without introducing deadlocks) would allow you to scale to arbitrary hardware so long as the annotations are somewhat accurate. But yeah it's fraught with complexity. See also some kind of back-off recovery for extreme pressure since the work is ideally possible to re-launch.
@dotstdy @tedmielczarek @regehr I don't think there's any way to make the annotations somewhat accurate

@dotstdy @tedmielczarek @regehr I think what would really help is if a build system had the knobs to suspend a process if it consumes too much memory. ld eats more than 2 GB? pause it, let other processes finish, then let it restart

I don't know if that's feasible

@whitequark @dotstdy @tedmielczarek it doesn't seem hard (deadlocks notwithstanding)
@regehr @dotstdy @tedmielczarek how would you do it? SIGSTOP?
@whitequark @dotstdy @tedmielczarek any convenient mechanism would be fine for pausing the process, and then I guess a new syscall to push the entire process out of RAM? I mean, most unixes have been able to do that, but I don't know that Linux currently can...

@regehr @whitequark @dotstdy @tedmielczarek this sounds a lot like memory.high in cgroups, where processes that exceed chosen usage get put under reclaim pressure and receive less CPU.

I'm not familiar with how it's actually done, but it feels like an OOMd equivalent that's managing a cgroup (containing subgroups for parallel build tasks) could both monitor memory use and throttle or suspend tasks if needed

@pmarheine @regehr @dotstdy @tedmielczarek I think cgroups may be the right tool to deal with this on Linux, but I have two concerns:

  • Linux is not the only operating system people run builds on
  • even on Linux, running 16 linkers in parallel can very quickly eat up a lot of memory in a way that merely adding reclaim pressure and reducing CPU quota will not solve
@whitequark @pmarheine @regehr @dotstdy @tedmielczarek You can definitely limit cgroup memory usage and force just a cgroup to OOM if it collectively runs out (eg, your 16 parallel linkers in a single cgroup). On a systemd based Linux this can be scripted, so you could run a build in an environment where it could only use a maximum of X amount of system RAM, although you'd have to decide the X in advance and if other bits used an unexpected amount of RAM you could still blow up.
@cks @pmarheine @regehr @dotstdy @tedmielczarek but I don't want them to collectively OOM, I want them to serialize themselves

@whitequark @pmarheine @regehr @dotstdy @tedmielczarek It looks like there's no good support for this in systemd/cgroup right now. Systemd has an API for getting memory pressure information¹, but I don't know of any command line programs you can run to watch this and take action for subordinate processes. Cgroup(v2) can freeze an entire cgroup² by writing to cgroup.freeze, but again one would have to write tooling around this.

¹ https://systemd.io/MEMORY_PRESSURE/
² https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Memory Pressure Handling

@whitequark @pmarheine @regehr @dotstdy @tedmielczarek I was hoping systemd would have a 'freeze this unit if there is memory pressure' or similar setting, but AFAIK there's nothing for that.

(You can make something not start if there's too much memory pressure, but that doesn't help for 'I started 16 linkers when the memory pressure was low and now they're all making the memory pressure high'.)

Oh well. Systemd doesn't do everything.

@cks @whitequark @pmarheine @regehr @dotstdy the classic GNU Make approach: "don't start any new jobs if the system load average is above this number": https://www.gnu.org/software/make/manual/html_node/Options-Summary.html#index-_002d_002dload_002daverage-1
Options Summary (GNU make)

Options Summary (GNU make)