How does multi-model routing interact with fallback?

Multi-model routing is the substrate; fallback is the retry pattern on top of it. See the llm-fallback-routing piece for the trigger and retry semantics. The multi-model deployment usually has fallback wired by default because the whole point of the pattern is provider resilience.

Do I need a BAA with every provider I route to?

Only for providers that will receive PHI-classified requests. The authorized-endpoint set constrains PHI routing to BAA-covered endpoints. Non-PHI requests can route to non-BAA endpoints. Deployers running mixed PHI and non-PHI traffic need at least one BAA-covered upstream and can supplement with non-BAA upstreams for the rest of the workload.

Which providers are hardest to normalize?

Bedrock is often the hardest because it exposes multiple model families (Anthropic, Cohere, Mistral, Meta) under one API surface, each with its own request and response shape. Deployers that treat Bedrock as a single upstream end up with per-model-family normalization inside a single provider. Azure OpenAI is usually the lightest to normalize because its API is close to native OpenAI.

What about Vertex, Together, Fireworks, Groq?

The invariant-and-variance model applies. Vertex has its own tool-call schema and streaming format. Together, Fireworks, and Groq offer OpenAI-compatible API surfaces for most models, which reduces normalization overhead. The deployer decides how many providers to onboard based on cost, availability, and normalization overhead.

Can I run the router without the gateway?

The router runs without the gateway in most reference implementations; the deployer then owns the compliance gap. See the llm-gateway-vs-llm-router piece for the split. Multi-model routing without a gateway in front produces the highest-variance failure mode because the six invariants have to be reimplemented per provider by the calling application.

How do I test the invariants hold across providers?

Contract tests. For each invariant, write a test that asserts the invariant against the router's response. Identity resolves to the same natural person regardless of upstream. Classification produces the same tag. Policy produces the same decision. Audit records have the same structure. Tool calls normalize to the same schema. Streaming events arrive in the same envelope. Run the tests against each provider o

LLM multi-model routing: the invariants that hold when you serve traffic from more than one provider

Serving LLM traffic from more than one provider decouples the deployer from a single vendor's pricing, availability, and policy decisions. The architecture is a router in front of two or more upstream endpoints (OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex, self-hosted vLLM). The router picks the endpoint per request. The invariants below hold regardless of which endpoint the router picks. The variances below require provider-specific handling.

I want to walk through the six invariants (what the deployer controls) and the three variances (what the providers control).

Invariant one: identity resolution

The caller identity resolves at the gateway, before the routing decision. The verified identity, role, and authorization context attach to the request as metadata. This invariant is deployer-controlled because the credential and the IdP live in the deployer's environment, not the provider's. Every provider sees the request as coming from the deployer's outbound identity; no provider sees the natural-person identity behind it.

The compliance implication is that the deployer owns the identity binding for the audit record. Provider logs record the deployer's API key; the deployer's audit record records the natural person. See the llm-inference-gateway piece for the resolution sequence.

Invariant two: data classification

The prompt is classified before it leaves the gateway. The classification result attaches to the request context and constrains which upstream endpoints the router can pick. A PHI classification limits the authorized set to BAA-covered endpoints; a PCI classification limits it to PCI-scoped endpoints; an internal-restricted classification limits it to self-hosted endpoints.

This invariant holds across providers because the classification runs on the prompt content, which is provider-agnostic. The invariant is what prevents multi-model routing from becoming a data-leakage pattern where the router silently sends regulated data to whichever endpoint is cheapest.

Invariant three: policy evaluation

The policy engine evaluates the request against per-route, per-role rules. Policy rules reference identity, role, classification, and request metadata. Policy decisions are provider-agnostic; the same rule set applies whether the router picks OpenAI, Anthropic, or Bedrock. The permit decision carries the authorized-endpoint set to the router.

Invariant four: per-decision audit record

Every request produces an audit record at the gateway. The record contains identity, role, policy version, classification, decision outcome, and timestamp. Provider-specific fields (chosen endpoint, model version, token count) attach as metadata but do not change the record's structure. The record is signed and committed before the response returns to the caller.

Invariant five: idempotency for tool calls

Tool calls produced by the LLM carry idempotency keys derived from the request ID. Downstream tool servers reject duplicates. This invariant holds regardless of which provider generated the tool call, because the key is deployer-supplied and the tool server is deployer-controlled.

Invariant six: response normalization

The gateway normalizes upstream responses into a consistent shape before returning to the caller. Providers return responses in different formats; the caller sees the deployer's normalized format. Normalization covers the response structure, the streaming event shape, and the error-response envelope.

Variance one: token accounting

Providers count tokens differently. OpenAI's tokenizer differs from Anthropic's differs from the Cohere tokenizer differs from LLaMA-family tokenizers. A prompt that fits in 8K tokens for one provider might fit in 9K for another. Token counts in the audit record should be recorded per-provider; a normalized token count derived from a common tokenizer produces misleading cost accounting.

The billing implication

Provider bills are the source of truth for cost. The audit record's token count is a decision-time estimate; the reconciliation to actual cost happens against the provider's invoice. Deployers running multi-model routing need cost-attribution infrastructure that reconciles both sides.

Variance two: streaming chunking

Providers stream responses at different granularities. OpenAI streams token-by-token in the classic SSE format. Anthropic streams event-based with message-start, content-block-start, delta, and message-stop events. Bedrock streams provider-specific event envelopes. The gateway's response normalization has to translate all three into a consistent stream shape for the caller.

The caller compatibility question

Callers that consume the normalized stream do not need to handle the provider variance. Callers that bypass the gateway and hit the provider directly need to handle each provider's stream format. The multi-model deployment usually pushes callers through the gateway to avoid duplicating the normalization logic in every client.

Variance three: tool-call format

Providers describe tool calls differently. OpenAI uses tool_calls arrays with function names and JSON arguments. Anthropic uses tool_use content blocks with tool names and structured inputs. Bedrock uses provider-specific schemas per model family. The gateway normalizes tool calls into a single deployer-controlled schema before the tool call reaches the executor.

The safety implication

Tool-call normalization is where the schema-validation and authorization checks happen. See the ai-response-tool-call-validation piece for the validation sequence. Without normalization, each provider's tool-call format needs its own validation code, and the deployer ends up with three parallel implementations that drift.

Beyond invariants and variances

Multi-model routing is a control-plane pattern that only pays off if the deployer owns the invariants. Deployers that let each provider define identity, classification, policy, audit, idempotency, and normalization are running three parallel single-model deployments that share a router. The multi-model benefit accrues to the deployer that treats the router as a substrate and the invariants as the product.

DeepInspect

This is the architecture DeepInspect was built to provide. DeepInspect is the identity-aware policy enforcement layer that produces the six invariants across every provider a deployer routes to. Every request resolves identity, classification, policy, and audit at the gateway. Every response passes through normalization for stream shape, tool-call schema, and error envelope. The router chooses the upstream endpoint from the authorized set the permit decision produced.

Every decision produces a per-decision audit record with identity, role, policy version, classification, decision outcome, timestamp, and chosen provider. The record is signed and tamper-evident. Multi-model deployments produce a single audit stream regardless of how many providers the router spreads traffic across.

Book a demo today.