Interlocking specialized models: routing and merging domain experts for compound AI systems.

The limits of a single head

The common approach is to take the largest available model, fine-tune it on the domain, and ship. That works well for open-ended tasks such as summarization, translation, and chat. It disappoints in narrow, high-stakes verticals.

A model fine-tuned jointly on, say, legal compliance and agricultural sensor data often ends up mediocre at both. The training signals interfere, a phenomenon the literature calls negative transfer. Legal reasoning rewards precise citation, cautious hedging, and adherence to jurisdictional rules; spatial and temporal modeling rewards tolerance for noisy, drifting inputs. Forcing both into one parameter budget asks the optimizer to satisfy objectives that pull in different directions.

The public work here, from the Branch-Train-Merge line to the broader Mixture-of-Experts and model-merging literature, points the same way: specialization helps because it relieves this interference. The size of the effect depends on how distinct the domains are, how the experts are trained, and how the system is evaluated, so we avoid quoting a single headline number. The direction holds: when domains genuinely diverge, dedicating capacity to each recovers quality that a single shared head gives up.

Interlocking architecture: design principles

These systems are not ensembles. An ensemble runs every model on every query and averages the results. An interlocking system is sparse by design: for a given input, only the relevant expert (or a small set of them) is activated. The view we take at Auxerta is that the foundation model is a shared trunk, and the experts are heads that attach to it through stable interfaces. Three components carry the design.

1. The router

A lightweight classifier inspects each incoming query and decides which expert should handle it. This is architecturally close to the gating function in Mixture-of-Experts models such as the Switch Transformer and Mixtral, but it operates at the system level rather than inside a layer.

The router is usually a small transformer, or even a simple classifier over query embeddings. It has to be fast and accurate, and above all calibrated: it should recognize when a query falls outside every expert domain and escalate rather than guess. Misrouting is the most common failure in these systems, so a router that knows the limits of its own confidence is more useful than one that is marginally more accurate but overconfident at the boundary.

2. Domain expert heads

Each head adapts the shared trunk to one target domain. A useful property: the heads need not be identical in shape. A risk head and a telemetry head can differ in capacity, adapter rank, and the data they were tuned on, while sharing the trunk that already carries general competence. The heterogeneity is deliberate: each head is sized for its own domain rather than for the union of all of them.

This has direct cost implications. A handful of compact experts attached to a shared trunk can be cheaper to serve than one very large generalist sized for peak demand, because most queries activate only a small slice of the system. The economics depend on the routing distribution, but the shape of the trade-off is consistent: you pay for the capacity you actually use.

3. The merge layer

Some queries span domains. A fraud investigation might need both transaction-pattern analysis and regulatory-compliance reasoning. The merge layer handles these compound queries by orchestrating multiple experts and reconciling their outputs.

This is the hardest component to get right, and it is where most of the open research sits. Naive concatenation of expert outputs produces incoherent answers. Two families of approach show up in the public literature: parameter-space merging, where expert weights are fused into a single set of parameters (knowledge fusion, model merging), and output-space synthesis, where each expert emits a structured result (findings, evidence, a confidence estimate) and a lightweight coordinator composes them. The coordinator is itself specialized, but its job is composition, not domain expertise. Which family wins depends on whether the experts share a lineage and how much the domains overlap; we treat this as an empirical question per deployment rather than a settled one.

Routing strategies in practice

Three patterns recur, each trading cost against robustness at the boundary.

Embedding-based classification

The simplest approach: encode the query with a general-purpose embedding model, then classify it against predefined domain clusters. It works well when domains are semantically distinct (medicine versus law) and struggles when they overlap (financial compliance versus financial trading), where the same vocabulary maps to different experts. Its appeal is latency: a single embedding pass and a cheap classifier.

Cascading routers

A two-stage design for ambiguous queries. A coarse first router makes a fast classification; if its confidence falls below a threshold, a second, more expensive stage, sometimes a small LLM examining intent, takes a closer look. This spends extra latency only on the uncertain tail, where naive routers are least reliable.

Multi-expert activation

When a query is compound, the router activates several experts and hands their outputs to the merge layer. It is the most expensive pattern, so the router should be conservative about invoking it: each extra expert adds latency and cost roughly linearly. The discipline is to reserve multi-activation for queries that a single expert demonstrably cannot cover, not for every query that looks slightly ambiguous.

Production considerations

A composed system introduces operational complexity that a single model avoids. For accuracy-critical applications the trade is usually worth it, provided teams account for the costs below.

Versioning becomes multi-dimensional. Each head has its own version and training-data lineage; the trunk has its own; the router has its own. A "system version" is a specific combination of all of them, which demands disciplined release tooling.
Monitoring is per-expert. A regression in one head does not implicate the others. The blast radius is contained, but the price is per-domain dashboards and alerting.
Adding a domain is modular. You train a new head on the existing trunk, register it with the router, and deploy. The other heads are untouched, which is simpler than retraining a monolith to absorb a new capability.
Cost scales with usage, not capacity. Low-traffic domains can be served on smaller infrastructure or cold-started on demand, whereas a monolith must be provisioned for peak aggregate load at all times.

What this means for how we build

The interlocking view reframes the question Auxerta cares about most. Our work is on post-transformer architectures: sub-quadratic sequence models whose trunk holds representations that stay useful beyond next-token prediction. Composition is what that trunk is for. A trunk that has learned transferable structure makes good heads cheap to attach; a trunk that has merely memorized surface statistics makes every head an uphill fight.

So the bottleneck we optimize against is not model count or serving plumbing, which commoditize. It is the quality of the shared representation and the stability of the interfaces along which heads attach, route, and merge. Get the trunk right and the compound system on top of it becomes an exercise in composition rather than a series of expensive retrainings. That connects our pretraining work to the specialized models we ship through Neognathae: train one trunk well, then attach the experts that each domain needs.