0 Followers
0 Following
5 Posts

This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.
Officialhttps://
Support this servicehttps://www.patreon.com/birddotmakeup

This is broadly correct for currently favoured software, but in computer science optimization problems you can usually trade off compute for memory and vice versa.

For example just now from the front page: https://news.ycombinator.com/item?id=47242637 "Speculative Speculative Decoding"

Or this: https://openreview.net/forum?id=960Ny6IjEr "Low-Rank Compression of Language Models Via Differentiable Rank Selection"

Speculative Speculative Decoding (SSD) | Hacker News

So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

This level corresponds to SMT in CPUs I gather. So you can argue your 192 core EPYC server cpu has 384 "vCPUs" since execution resources per core are overprovisioned and when execution blocks waiting for eg memory another thread can run in its place. As Intel and AMD only do 2-way SMT this doesn't make the numbers go up as much.

The single GPU warp is both beefier and wimpier than the SMT thread: they're in-order barely superscalar, whereas on CPU side it's wide superscalar big-window OoO brainiac. But on the other hand the SM has wider SIMD execution resources and there's enough througput for several warps without blocking.

A major difference is how the execution resources are tuned to the expected workloads. CPU's run application code that likes big low latency caches and high single thread performance on branchy integer code, but it doesn't pay to put in execution resources for maximizing AVX-512 FP math instructions per cycle or increasing memory bandwidth indefinitely.

GPUs having have thousands of cores is just a silly marketing newspeak.

They rebranded SIMD lanes "cores". For eaxmple Nvidia 5000 series GPUs have 50-170 SMs which are the equivalent of cpu cores there. So a more than desktops, less than bigger server CPUs. By this math each avx-512 cpu core has 16-64 "gpu cores".

Interesting that OpenBLAS and MPS are reportedly nearly the same speed although the README sounds like only MPS uses the GPU.