← Blog

LLM routing strategies: five patterns for production, and where the policy decision constrains each one

LLM routing strategies decide which model, provider, or endpoint handles a given request. Five patterns cover most production deployments: static routing, cost-optimized routing, quality-tiered routing, latency-budgeted routing, and fallback routing. Each pattern operates on request metadata after the policy decision at the gateway has authorized the request and produced the audit record. This piece walks through the five patterns, what each optimizes for, and the constraints the gateway places on all of them.

ByParminder Singh· Founder & CEO, DeepInspect Inc.
Platform & Architecturellm-routerllm-routingai-architectureai-gatewaymulti-model
LLM routing strategies: five patterns for production, and where the policy decision constrains each one

Every production LLM deployment eventually needs a routing decision that goes beyond "send it to GPT-4." The reasons are usually some combination of cost, quality, latency, and availability. The five patterns below cover most deployments running more than one model in production. Each pattern operates on request metadata after the policy decision at the gateway has authorized the request and produced the per-decision audit record. That order matters: a routing pattern that runs before the policy decision routes to endpoints the caller was not authorized to reach.

I want to walk through the five patterns, what each optimizes for, and the constraints the gateway places on all of them.

Pattern one: static routing

Static routing sends every request from a given caller, tenant, or workload to a fixed model. The routing table is a simple lookup: caller X to model Y.

What it optimizes for

Predictability. Cost forecasting. Compliance clarity when the workload has a fixed model requirement (for example, a HIPAA workload that must use a BAA-covered endpoint).

When it fits

Regulated workloads with a single approved model. Cost-controlled workloads with a strict budget per request. Any workload where multi-model variability would produce more operational overhead than benefit.

The gateway constraint

The gateway's permit decision authorizes the caller to reach a specific endpoint set. Static routing works fine when the routing table only contains authorized endpoints. It fails when the routing table encodes endpoints the caller cannot reach; the router should read the authorized-endpoint set from the permit decision rather than a static table.

Pattern two: cost-optimized routing

Cost-optimized routing scores the request against a cost function and picks the cheapest model that can handle it. Short prompts route to smaller models; long prompts route to larger models; simple prompts route to cheaper models even when large-model budget is available.

What it optimizes for

Dollar cost per request. Total spend under a monthly budget.

When it fits

High-volume workloads where the average request is well below the complexity ceiling. Consumer-facing applications where p95 quality matters more than p99.

The gateway constraint

The cost function usually references prompt length, model family, and tenant tier. It should not reference caller identity or role directly, because those attributes are the gateway's domain. If the cost function needs to reference tenant-level pricing, the tenant tier travels from the permit decision as authorized metadata, not from the caller-supplied request headers.

Pattern three: quality-tiered routing

Quality-tiered routing classifies the request against a quality-difficulty taxonomy and routes to the smallest model that will produce an acceptable response. Difficult prompts (long context, ambiguous instructions, tool-use chains) route to larger models; simpler prompts route to smaller ones.

What it optimizes for

The quality-cost trade-off. The tail of low-quality responses that damage user experience.

When it fits

Production applications where response quality is user-visible and inconsistent quality damages retention. Any workload where a lightweight eval set exists to measure quality regression per model tier.

The gateway constraint

The classifier that produces the difficulty score runs on prompt content. It should not double as the data-classification step; the two classifications answer different questions. The gateway's data classifier tags sensitivity for policy purposes. The router's difficulty classifier tags complexity for routing purposes. Keeping the two separate prevents the routing layer from becoming a policy decision point by accident.

Pattern four: latency-budgeted routing

Latency-budgeted routing routes to the model that will meet the request's stated latency budget. Interactive requests with a 500 ms budget route to fast models; batch requests with a 30-second budget route to larger models.

What it optimizes for

p95 and p99 latency guarantees. User experience in interactive workloads.

When it fits

Applications with mixed interactive and batch traffic. Deployments where the SLA is defined in latency terms.

The gateway constraint

The latency budget travels from the request. The router should validate the budget against the caller's authorized latency class. A caller authorized only for batch workloads should have their interactive-budget requests rejected at the routing layer with a message pointing back to the gateway's policy decision.

Pattern five: fallback routing

Fallback routing defines a chain: primary model, secondary model, tertiary model. When the primary fails or returns an error, the router retries against the secondary.

What it optimizes for

Availability. Recovery time when a provider degrades or returns a rate-limit error.

When it fits

Production workloads where provider outages happen (all of them). Multi-provider deployments where the deployer has commercial relationships with more than one LLM vendor.

The gateway constraint

Every model in the fallback chain must sit inside the authorized-endpoint set the permit decision produced. A fallback that reaches an unauthorized endpoint violates the policy. The router should either skip the unauthorized fallback and continue down the chain, or fail the request and produce an operational log entry indicating that the fallback chain was exhausted within the authorized set. See the llm-fallback-routing piece for the full sequence.

What all five patterns share

Every routing pattern reads the permit decision's authorized-endpoint set, respects the classification tag, and honors the policy version. Every pattern produces an operational log entry that ties the routing choice back to the audit record produced at the gateway. When an auditor asks "which model served this specific request," the answer comes from correlating the audit record's request ID with the router's operational log.

The three fields that link the two layers

The gateway attaches three fields to the request before it leaves the policy decision point: the request ID, the authorized-endpoint set, and the policy version. The router records the same request ID against its chosen upstream endpoint. Correlating the two records answers the audit question without requiring either layer to know the other's internal state.

Beyond the five patterns

Advanced deployments compose the patterns. Cost-optimized routing with a quality regression fallback. Latency-budgeted routing with a fallback chain. Static routing for regulated tenants and cost-optimized routing for others. The composition matters less than the invariant: routing runs after the gateway's policy decision and inside the authorized-endpoint set.

Bundled tools like Kong AI Gateway, LiteLLM Proxy, MLflow AI Gateway, and Databricks AI Gateway ship variations of these patterns. See the individual comparison pieces for the trade-offs.

DeepInspect

This is where DeepInspect's architecture matters for routing. DeepInspect is the identity-aware policy enforcement layer that produces the permit decision the router respects. Every request that reaches the router carries the authorized-endpoint set, the classification tag, and the policy version. The router chooses the endpoint within the authorized set. When the classification tag says PHI, the authorized set excludes non-BAA endpoints regardless of what the cost or latency function would prefer.

Every decision produces a per-decision audit record with identity, role, policy version, data classification, decision outcome, and timestamp. The router's operational log ties the routing choice back to the audit record through the request ID. Compliance questions about routing get answered from the audit record; operational questions get answered from the router log.

Book a demo today.

Frequently asked questions

Can I run a router without a gateway in front of it?

Technically yes. Practically, the router will route to endpoints based on caller-supplied metadata without verifying identity, without evaluating policy, and without producing audit records. Any regulated workload will fail an audit under that topology. Deployments that run a router alone effectively ship the compliance burden to the calling application, which usually does not have the identity context or the audit-write path to satisfy it.

How do I pick between quality-tiered and cost-optimized routing?

Start with the metric that matters most. If the retention risk from low-quality responses exceeds the cost savings from model downgrade, quality-tiered wins. If the cost curve dominates the retention curve, cost-optimized wins. Most deployments end up with a hybrid: cost-optimized for the request classes where quality is uniform, quality-tiered for the classes where quality varies.

Does latency-budgeted routing require synchronous latency measurement?

Yes, at some level. The router needs a rolling estimate of each upstream's current p95 latency to make an informed choice. Passive monitoring (recording upstream latency per completed request) is sufficient for most deployments. Active probing (synthetic requests) helps when the upstream latency shifts faster than the passive sample can track.

How does fallback routing interact with idempotency?

The retry semantics of the fallback depend on whether the primary already produced side effects. For LLM requests that are pure inference (no tool calls, no state changes), retry is safe. For requests that trigger tool calls or write to a database, the router needs the idempotency key the caller supplied. See the ai-response-tool-call-validation piece for the tool-call side.

What about semantic routing (routing based on prompt embedding similarity)?

Semantic routing is a variation of quality-tiered routing where the difficulty classifier is an embedding-similarity function against a labeled reference set. The architectural placement is the same: it runs after the gateway's policy decision, on request metadata, inside the authorized-endpoint set. The embedding classifier does not double as the data-classification step.

How do I audit routing decisions?

The audit record produced by the gateway does not need to record the routing choice. The routing choice is operational, not compliance. The audit record needs to record the request ID, the classification, and the policy version. The router's operational log records the routing choice against the same request ID. Correlating the two records answers routing questions when an operations team needs them.