Wow. An important monitoring tool for #Nvidia #GPUs, nvidia-smi hangs indefinitely after ~66 days of uptime. Corresponding to when a count of "jiffs" (750 Hz) would overflow a 32-bit unsigned number.

github.com/NVIDIA/open-gpu-kernel-modules/issues/971

#bug #overflow #ai
nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0 · Issue #971 · NVIDIA/open-gpu-kernel-modules

NVIDIA Open GPU Kernel Modules Version [root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID:...

GitHub