<li>How to enable efficient collaboration between different models?</li>
<li>How to secure systems from jailbreaks, PII leaks, and hallucinations?</li>
<li>How to collect valuable signals and build a self-learning system?</li>
</ol>

<p>With <strong>vLLM Semantic Router (vLLM-SR) v0.

1</strong>, we’ve deployed a live MoM system on AMD <strong>MI300X/MI355X</strong> GPUs that demonstrates these capabilities in action—routing queries across 6 specialized models using 8 signal types and 11 decision rules with the performance boost.</p>

<p><strong>🎮 Try it live: <a href="https://twp.ai/13fZUKY">https://twp.ai/13fZUKY</a></strong></p>

<h2 id="table-of-contents">Table of Contents</h2>

<ul>

vLLM Semantic Router

vLLM Semantic Router - Manage your AI-powered Intelligent Router

<li><a href="#mixture-of-models-vs-mixture-of-experts">Mixture-of-Models vs Mixture-of-Experts</a></li>
<li><a href="#the-mom-design-philosophy">The MoM Design Philosophy</a></li>
<li><a href="#live-demo-on-amd-gpus">Live Demo on AMD GPUs</a></li>
<li><a href="#signal-based-routing">Signal-Based Routing</a></li>
<li><a href="#deploy-your-own">Deploy Your Own</a></li>
</ul>

<hr />

<h2 id="mixture-of-models-vs-mixture-of-experts">Mixture-of-Models vs Mixture-of-Experts</h2>

<p>Before diving in, let’s clarify a common confusion: <strong>MoM is not MoE</strong>.</p>

<p><img src="/assets/figures/semantic-router/mom-1.png" alt="" /></p>

<h3 id="mixture-of-experts-moe-intra-model-routing">Mixture-of-Experts (MoE): Intra-Model Routing</h3>

<p>MoE is an <strong>architecture pattern inside a single model</strong>.

Models like Mixtral, DeepSeek-V3, and Qwen3-MoE use sparse activation—for each token, only a subset of “expert” layers are activated based on a learned gating function.</p>

<p><strong>Key characteristics:</strong></p>

<ul>
<li>Routing happens at the <strong>token level</strong>, inside forward pass</li>
<li>Router is <strong>learned during training</strong>, not configurable</li>
<li>All experts share the same training objective</li>

<li>Reduces compute per token while maintaining capacity</li>
</ul>

<h3 id="mixture-of-models-mom-inter-model-orchestration">Mixture-of-Models (MoM): Inter-Model Orchestration</h3>

<p>MoM is a <strong>system architecture pattern</strong> that orchestrates multiple independent models. Each model can have different architectures, training data, capabilities, and even run on different hardware.</p>

<p><strong>Key characteristics:</strong></p>

<ul>

<li>Routing happens at the <strong>request level</strong>, before inference</li>
<li>Router is <strong>configurable at runtime</strong> via signals and rules</li>
<li>Models can have completely different specializations</li>
<li>Enables cost optimization, safety filtering, and capability matching</li>
</ul>

<h3 id="why-this-distinction-matters">Why This Distinction Matters</h3>

<table>
<thead>
<tr>
<th>Aspect</th>
<th>MoE</th>
<th>MoM</th>
</tr>
</thead>

<tbody>
<tr>
<td><strong>Scope</strong></td>
<td>Single model architecture</td>
<td>Multi-model system design</td>
</tr>
<tr>
<td><strong>Routing granularity</strong></td>
<td>Per-token</td>
<td>Per-request</td>
</tr>
<tr>
<td><strong>Configurability</strong></td>
<td>Fixed after training</td>
<td>Runtime configurable</td>
</tr>
<tr>
<td><strong>Model diversity</strong></td>
<td>Same architecture</td>

<td>Any architecture</td>
</tr>
<tr>
<td><strong>Use case</strong></td>
<td>Efficient scaling</td>
<td>Capability orchestration</td>
</tr>
</tbody>
</table>

<p><strong>The insight</strong>: MoE and MoM are complementary. You can use MoE models (like Qwen3-30B-A3B) as components within a MoM system—getting the best of both worlds.</p>

<p><img src="/assets/figures/semantic-router/mom-0.png" alt="" /></p>

<hr />

<h2 id="the-mom-design-philosophy">The MoM Design Philosophy</h2>

<h3 id="why-not-just-use-one-big-model">Why Not Just Use One Big Model?</h3>

<p>The “one model to rule them all” approach has fundamental limitations:</p>

<ol>
<li><strong>Cost inefficiency</strong>: A 405B model processing “What’s 2+2?” wastes 99% of its capacity</li>
<li><strong>Capability mismatch</strong>: No single model excels at everything—math, code, creative writing, multilingual</li>

<li><strong>Latency variance</strong>: Simple queries don’t need 10-second reasoning chains</li>
<li><strong>No separation of concerns</strong>: Safety, caching, and routing logic baked into prompts</li>
</ol>

<h3 id="the-mom-solution-collective-intelligence">The MoM Solution: Collective Intelligence</h3>

<p>MoM treats AI deployment like building a <strong>team of specialists</strong> with a smart dispatcher:</p>

<p><img src="/assets/figures/semantic-router/mom-2.png" alt="" /></p>

<p><strong>Core Principles:</strong></p>

<ol>
<li><strong>Signal-Driven Decisions</strong>: Extract semantic signals (intent, domain, language, complexity) before routing</li>
<li><strong>Capability Matching</strong>: Route math to math-optimized models, code to code-optimized models</li>
<li><strong>Cost-Aware Scheduling</strong>: Simple queries → small/fast models; Complex queries → large/reasoning models</li>

<li><strong>Safety as Infrastructure</strong>: Jailbreak detection, PII filtering, and fact-checking as first-class routing signals</li>
</ol>

<hr />

<h2 id="live-demo-on-amd-gpus">Live Demo on AMD GPUs</h2>

<p>We’ve deployed a live demo system powered by <strong>AMD MI300X GPUs</strong> that showcases the full MoM architecture:</p>

<p><strong>🎮 <a href="https://twp.ai/18MEjWc">https://twp.ai/18MEjWc</a></strong></p>

vLLM Semantic Router

vLLM Semantic Router - Manage your AI-powered Intelligent Router

<p><img src="/assets/figures/semantic-router/mom-4.png" alt="Live Demo on AMD GPUs" /></p>

<h3 id="the-demo-system-architecture">The Demo System Architecture</h3>

<p>The AMD demo system implements a complete MoM pipeline with <strong>6 specialized models</strong> and <strong>11 routing decisions</strong>:</p>

<p><strong>Models in the Pool:</strong></p>

<table>
<thead>
<tr>
<th>Model</th>
<th>Size</th>
<th>Specialization</th>
</tr>
</thead>
<tbody>
<tr>

<td><strong>Qwen3-235B</strong></td>
<td>235B</td>
<td>Complex reasoning (Chinese), Math, Creative</td>
</tr>
<tr>
<td><strong>DeepSeek-V3.2</strong></td>
<td>320B</td>
<td>Code generation and analysis</td>
</tr>
<tr>
<td><strong>Kimi-K2-Thinking</strong></td>
<td>200B</td>
<td>Deep reasoning (English)</td>
</tr>
<tr>
<td><strong>GLM-4.7</strong></td>
<td>47B</td>
<td>Physics and science</td>

</tr>
<tr>
<td><strong>gpt-oss-120b</strong></td>
<td>120B</td>
<td>General purpose, default fallback</td>
</tr>
<tr>
<td><strong>gpt-oss-20b</strong></td>
<td>20B</td>
<td>Fast QA, security responses</td>
</tr>
</tbody>
</table>

<p><strong>Routing Decision Matrix:</strong></p>

<table>
<thead>
<tr>
<th>Priority</th>
<th>Decision</th>
<th>Trigger Signals</th>
<th>Target Model</th>
<th>Reasoning</th>

</tr>
</thead>
<tbody>
<tr>
<td>200</td>
<td><code class="language-plaintext highlighter-rouge">guardrails</code></td>
<td><code class="language-plaintext highlighter-rouge">keyword: jailbreak_attempt</code></td>
<td>gpt-oss-20b</td>
<td>off</td>
</tr>
<tr>
<td>180</td>
<td><code class="language-plaintext highlighter-rouge">complex_reasoning</code></td>
<td><code class="language-plaintext highlighter-rouge">embedding: deep_thinking</code> + <code class="language-plaintext highlighter-rouge">language: zh</code></td>
<td>Qwen3-235B</td>
<td>high</td>
</tr>
<tr>
<td>160</td>
<td><code class="language-plaintext highlighter-rouge">creative_ideas</code></td>
<td><code class="language-plaintext highlighter-rouge">keyword: creative</code> + <code class="language-plaintext highlighter-rouge">fact_check: no_check_needed</code></td>
<td>Qwen3-235B</td>
<td>high</td>
</tr>
<tr>
<td>150</td>
<td><code class="language-plaintext highlighter-rouge">math_problems</code></td>
<td><code class="language-plaintext highlighter-rouge">domain: math</code></td>
<td>Qwen3-235B</td>
<td>high</td>
</tr>
<tr>
<td>145</td>
<td><code class="language-plaintext highlighter-rouge">code_deep_thinking</code></td>
<td><code class="language-plaintext highlighter-rouge">domain: computer_science</code> + <code class="language-plaintext highlighter-rouge">embedding: deep_thinking</code></td>
<td>DeepSeek-V3.2</td>
<td>high</td>
</tr>
<tr>
<td>145</td>
<td><code class="language-plaintext highlighter-rouge">physics_problems</code></td>
<td><code class="language-plaintext highlighter-rouge">domain: physics</code></td>
<td>GLM-4.7</td>
<td>medium</td>
</tr>
<tr>
<td>140</td>
<td><code class="language-plaintext highlighter-rouge">deep_thinking</code></td>
<td><code class="language-plaintext highlighter-rouge">embedding: deep_thinking</code> + <code class="language-plaintext highlighter-rouge">language: en</code></td>
<td>Kimi-K2-Thinking</td>
<td>high</td>
</tr>
<tr>
<td>135</td>
<td><code class="language-plaintext highlighter-rouge">fast_coding</code></td>
<td><code class="language-plaintext highlighter-rouge">domain: computer_science</code> + <code class="language-plaintext highlighter-rouge">language: en</code></td>
<td>gpt-oss-120b</td>
<td>low</td>
</tr>
<tr>
<td>130</td>
<td><code class="language-plaintext highlighter-rouge">fast_qa_chinese</code></td>