TurboQuant model weight compression now graces #Llamacpp, but only if you speak fluent Metal! 🏋️‍♂️ Meanwhile, everyone else waits for TheTom to bless us with a #CUDA port, assuming he ever emerges from the GitHub labyrinth of Pull Request 45. How many engineers does it take to compress a llama? 🤔
https://github.com/TheTom/llama-cpp-turboquant/pull/45 #TurboQuant #Metal #PullRequest #HackerNews #ngated
feat: TQ4_1S weight compression (Metal only, needs CUDA port) by TheTom · Pull Request #45 · TheTom/llama-cpp-turboquant

Summary TQ3_1S (3-bit, 4.0 BPW) and TQ4_1S (4-bit, 5.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids V2.1 fused Metal kernel: zero threadgroup memory, cooperative SIMD rotation...

GitHub