Running LLM models locally on common hardware is getting closer...
"TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy, all while achieving a faster runtime than the original LLMs"
https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
