Rust’s Standard Library on the GPU

https://www.vectorware.com/blog/rust-std-on-gpu/

Rust's standard library on the GPU

GPU code can now use Rust's standard library. We share the implementation approach and what this unlocks for GPU programming.

I think it is possible to run CPU code on GPU (including the whole OS), because GPU has registers, memory, arithmetic and branch instructions, and that should be enough. However, it will be able to use only several cores from many thousands because GPU cores are effectively wide SIMD cores, grouped into the clusters, and CPU-style code would use only single SIMD lane. Am I wrong?

GPUs having have thousands of cores is just a silly marketing newspeak.

They rebranded SIMD lanes "cores". For eaxmple Nvidia 5000 series GPUs have 50-170 SMs which are the equivalent of cpu cores there. So a more than desktops, less than bigger server CPUs. By this math each avx-512 cpu core has 16-64 "gpu cores".

Each SM can typically schedule 4 warps so it’s more like 400 “cores” each with 1024-bit SIMD instructions. If you look at it this way, they clearly outclass CPU architectures.

This level corresponds to SMT in CPUs I gather. So you can argue your 192 core EPYC server cpu has 384 "vCPUs" since execution resources per core are overprovisioned and when execution blocks waiting for eg memory another thread can run in its place. As Intel and AMD only do 2-way SMT this doesn't make the numbers go up as much.

The single GPU warp is both beefier and wimpier than the SMT thread: they're in-order barely superscalar, whereas on CPU side it's wide superscalar big-window OoO brainiac. But on the other hand the SM has wider SIMD execution resources and there's enough througput for several warps without blocking.

A major difference is how the execution resources are tuned to the expected workloads. CPU's run application code that likes big low latency caches and high single thread performance on branchy integer code, but it doesn't pay to put in execution resources for maximizing AVX-512 FP math instructions per cycle or increasing memory bandwidth indefinitely.

Right, but the CPU does not have a matrix multiply unit or high bandwidth memory.