Google releases Gemma 4 open models
Google releases Gemma 4 open models
Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards:
| Model | MMLUP | GPQA | LCB | ELO | TAU2 | MMMLU | HLE-n | HLE-t |
|----------------|-------|-------|-------|------|-------|-------|-------|-------|
| G4 31B | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% |
| G4 26B A4B | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% | 8.7% | 17.2% |
| G4 E4B | 69.4% | 58.6% | 52.0% | 940 | 42.2% | 76.6% | - | - |
| G4 E2B | 60.0% | 43.4% | 44.0% | 633 | 24.5% | 67.4% | - | - |
| G3 27B no-T | 67.6% | 42.4% | 29.1% | 110 | 16.2% | 70.7% | - | - |
| GPT-5-mini | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% |
| GPT-OSS-120B | 80.8% | 80.1% | 82.7% | 2157 | -- | 78.2% | 14.9% | 19.0% |
| Q3-235B-A22B | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% | -- |
| Q3.5-122B-A10B | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% |
| Q3.5-27B | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% |
| Q3.5-35B-A3B | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% | MMLUP: MMLU-Pro
GPQA: GPQA Diamond
LCB: LiveCodeBench v6
ELO: Codeforces ELO
TAU2: TAU2-Bench
MMMLU: MMMLU
HLE-n: Humanity's Last Exam (no tools / CoT)
HLE-t: Humanity's Last Exam (with search / tool)
no-T: no think
Wild differences in ELO compared to tfa's graph: https://storage.googleapis.com/gdm-deepmind-com-prod-public/...
(Comparing Q3.5-27B to G4 26B A4B and G4 31B specifically)
I'd assume Q3.5-35B-A3B would performe worse than the Q3.5 deep 27B model, but the cards you pasted above, somehow show that for ELO and TAU2 it's the other way around...
Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.
Overall great news if it's at parity or slightly better than Qwen 3.5 open weights, hope to see both of these evolve in the sub-32GB-RAM space. Disappointed in Mistral/Ministral being so far behind these US & Chinese models
> Wild differences in ELO compared to tfa's graph
Because those are two different, completely independent Elos... the one you linked is for LMArena, not Codeforces.
> Very impressed by unsloth's team releasing the GGUF so quickly, if that's like the qwen 3.5, I'll wait a few more days in case they make a major update.
Same here. I can't wait until mlx-community releases MLX optimized versions of these models as well, but happily running the GGUFs in the meantime!
Edit: And looks like some of them are up!
You're conflating lmarena ELO scores.
Qwen actually has a higher ELO there. The top Pareto frontier open models are:
model |elo |price
qwen3.5-397b-a17b |1449 |$1.85
glm-4.7 |1443 | 1.41
deepseek-v3.2-exp-thinking |1425 | 0.38
deepseek-v3.2 |1424 | 0.35
mimo-v2-flash (non-thinking) |1393 | 0.24
gemma-3-27b-it |1365 | 0.14
gemma-3-12b-it |1341 | 0.11
gpt-oss-20b |1318 | 0.09
gemma-3n-e4b-it |1318 | 0.03
What Gemma seems to have done is dominate the extreme cheap end of the market. Which IMO is probably the most important and overlooked segment
I just tried with llama.cpp RTX4090 (24GB) GGUF unsloth quant UD_Q4_K_XL
You can probably run them all. G4 31B runs at ~5tok/s , G4 26B A4B runs at ~150 tok/s.
You can run Q3.5-35B-A3B at ~100 tok/s.
I tried G4 26B A4B as a drop-in replacement of Q3.5-35B-A3B for some custom agents and G4 doesn't respect the prompt rules at all. (I added <|think|> in the system prompt as described (but have not spend time checking if the reasoning was effectively on). I'll need to investigate further but it doesn't seem promising.
I also tried G4 26B A4B with images in the webui, and it works quite well.
I have not yet tried the smaller models with audio.
Reversing the X and Y axis, adding in a few other random models, and dropping all the small Qwens makes this worse than useless as a Qwen 3.5 comparison, it’s actively misleading. If you’re using AI, please don’t rush to copy paste output :/
EDIT: Lordy, the small models are a shadow of Qwen's smalls. See https://huggingface.co/Qwen/Qwen3.5-4B versus https://www.reddit.com/r/LocalLLaMA/comments/1salgre/gemma_4...
I transposed the table so that it's readable on mobile devices.
I should have mentioned that the Qwen 3.5 benchmarks were from the Qwen3.5-122B-A10B model card (which includes GPT-5-mini and GPT-OSS-120B); apologies for not including the smaller Qwen 3.5 models.