Google Built Gemma 4 12B Without Multimodal Encoders
https://firethering.com/google-built-gemma-4-12b-without-multimodal-encoders/

Google Built Gemma 4 12B Without Multimodal Encoders - Firethering
Every multimodal model you've used has the same basic system. Text goes in one way, images go through a vision encoder first, audio goes through an audio encoder first, and then everything gets handed off to the language model in a form it can work with. The encoders are load-bearing and you don't just remove them.Google actually removed them.Gemma 4 12B takes raw image patches and raw audio waveforms and projects them directly into the same embedding space as text tokens. There is no vision encoder or audio encoder. One decoder handling everything.