FACT: the first time you boot up a multi-socket system you must put on Dragostea Din Tei
nu mă, nu mă, nu mă iei
[root@pano-1:~]# free -m
total used free shared buff/cache available
Mem: 451201 6189 446813 21 422 445012
Swap: 0 0 0

[root@pano-1:~]# nproc
256

ok here we go

i'm going to the /nix/store, anybody needs me to compile anything?
nice
this counter increments upsettingly fast
i bet if i run vivado on it, it's still going to take 20 minutes to run PwrOpt without producing a single console print
oh that was a Release build. a Debug build is even faster somehow. absurd

anyway the real purpose of doing that was so i could run a load test. compiling at full blast on all 256 cores, this thing heats up to a chilly... uh... barely 40°C... and while the fans are kind of loud the noise is pleasant enough that i don't mind listening to it all day. sort of like a small air purifier on high

i'm going to call this a success and get a case for it then

literally as i typed that the BMC decided to Angrily Beep at me because i think some sort of thermal threshold was exceeded despite zero cores being above 60°C. motherfucker how do i turn this off

ohhhhhh

it is unhappy about the lack of airflow over Vregs. i thought i could hit that problem a few days ago but completely blanked on checking for how i could address this before doing the load test

i could not figure out how to turn it off via the web interface of the BMC. after i gave up and powered it off i realized that it gave me (mild and probably temporary) tinnitus

man. this should have a hazard sign on it

i turned it on again and it continued screaming. man what the fuck
apparently the only way to clear the alert is to power cycle the entire system?? not even IPMI sel clear or bmc reset cold commands do it??? I will need hearing protection
I mean it's relatable, whomst of us did not decide to wake up and immediately start SCREAMING AT THE TOP OF THEIR LUNGS but cmon. Please do not blow my eardrums out or at least do it either slower or much quicker
do you think this is an enterprise-grade solution, or should i add some duct tape to it?
in case you're wondering, yes, this does cool the VRMs down to under 80°C, even at the full 48A@12V where it dissipates 576W on the CPUs alone
this is how i'm measuring power draw btw. the current clamp is surprisingly accurate
i think the manufacturer-intended way of doing this is some sort of special enterprise power supply connected to the motherboard via some sort of special enterprise cable. which makes sense, because the PSU already knows exactly how much power it's sending down. but it's lame that with a consumer unit i get no current sensing at all. unless i add it myself
@whitequark Stokes' theorem doesn't lie.
@whitequark
This is the best measuring device, I ever had.
@whitequark I have the same current clamp at work, it is indeed a very nice piece of hardware!

@whitequark I curse everyone who helped convince PC people that color-coded cables were "ketchup and mustard" and that that was bad somehow.

I have to look up pinouts and hope for the best when doing anything with them, but at least it's visually pleasing to someone I guess.

@whitequark add a few zip-ties and I think you'll pass inspection
@whitequark this is more secure than some other server fans i've seen. approved
@whitequark Reminds me of the time I hot-swapped the CPU fan in my desktop when the bearing turned into a noisemaker. I was too lazy to shut it off, so I just swapped it live...
@whitequark zipties through the motherboard mount points

@olasd @whitequark black zip ties! Not the cheap whitish stuff.

(That's how parts of the infra in my basement holds together)

@whitequark needs zip-ties for enterprise-flair
@whitequark Just remember to use black duct tape. It looks more professional.

@whitequark Reminds me of when I installed a manual car radiator in my automatic car, and had to jerry rig a separate radiator for my automatic gearbox... Which I then used some random blue rope to tie into the car engine compartment.

It lasted years and I sold it on like that 😅

@whitequark

Are the screws loose or is there still the "please remove before use" film between one of the coolers and the CPU?

Only then it is truly enterprise quality hardware.

@whitequark maybe blu tack it to the RAM modules so it doesn’t wander off from RPM changes
@whitequark Polyimide tape over the beeper to attenuate future annoyances?
@whitequark Problem: datacenter is so loud no one can hear normal alert noises.
"Solution": SCREAMING! LOTS OF SCREAMING! SURELY WE WILL NOT REGRET SCREAMING!
@whitequark Catherine’s Cursed Computers c-Emporium

@whitequark

Cache coherency protocols hit a bunch of issues when you have more than 128 cores. It's less well known, but at 256 the wrong combination of messages can open a vortex to an eldritch dimension full of horrors. That can cause screaming.

@whitequark time to add a physical mute switch that disconnects the speaker?
@whitequark They make those damn things loud enough to be heard from outside the box when in a rack with 340 screaming 40mm fans.

@whitequark if you're case-shopping: is anyone making a corsi-rosenthal case yet?

(it seems like such an obvious thing now that we've been using CR boxes for years and the fans sill look pretty much like they just came out of their original packaging)

@JamesWidman lmao yeah i had the thought
@JamesWidman this system is watercooled with just 3 fans so i don't think it'd make a good CR box
@whitequark @JamesWidman
I can definitely recommend this course of action. It worked hilariously well during the 2020 Australian bush fires.
Automotive HEPA and charcoal cabin filters come in a wild variety of shapes and sizes
@whitequark
That's a very short NINJA_STATUS! Mine is NINJA_STATUS=%p (%P) [%f:%s/%t] %o/s, %ws (%Ws), which gives some nicer progress estimates.
@whitequark cat i love you but please CW your porn /j
@whitequark ​​ holy shit thats a lot of cores and RAM
@privateger donation that i'm going to build a community ci service on top of
@whitequark "But will it run Crysis?"
@whitequark Time to install Gentoo again :-D
@whitequark one system closure with entirely too many rocm, please
@whitequark highway for powerpc
@izzy @whitequark classic powerpc problems "On large CPU-count Power systems (~2000 CPUs), SMT mode toggling via ppc64_cpu --smt={on,off} takes ~1 hour due to synchronize_rcu() calls during per-CPU hotplug operations."