Wanted to try renting a GPU for an open weight model for a while. Specifically runpod. With Gemma 4 released, I finally had a reason to try. It works, though it was a bit clumsy. Here is a container for y'all to try the Gemma 4 31B in serverless with llama.cpp and unsloth 8 bit quant.

It seems to be a charming, cheap and privacy preserving way to do LLMs. Might try the smaller ones for even better efficiency when I have thought of a systematic way to evaluate. https://github.com/burakemir/runpod-gemma4

GitHub - burakemir/runpod-gemma4: runpod worker for Gemma4 with llama.cpp

runpod worker for Gemma4 with llama.cpp. Contribute to burakemir/runpod-gemma4 development by creating an account on GitHub.

GitHub
@burakemir Very interesting! I've been dipping my toes into local models, but mostly using OpenRouter. This looks like a good entrypoint into try out runpod :D