Wanted to try renting a GPU for an open weight model for a while. Specifically runpod. With Gemma 4 released, I finally had a reason to try. It works, though it was a bit clumsy. Here is a container for y'all to try the Gemma 4 31B in serverless with llama.cpp and unsloth 8 bit quant.
It seems to be a charming, cheap and privacy preserving way to do LLMs. Might try the smaller ones for even better efficiency when I have thought of a systematic way to evaluate. https://github.com/burakemir/runpod-gemma4




