Is per-request routing too expensive to be practical?

The gateway evaluates the routing rule in tens of milliseconds. The model inference takes 500 ms to 5 seconds. The routing overhead is invisible relative to inference time.

Can the routing rule live in the application instead of a gateway?

The application can implement the rule but has limited visibility into identity context, policy context, and audit retention. Application-level routing also fragments the rule across services; a central rule in the gateway is enforceable consistently.

What is the right eval set size?

Five hundred samples is the floor for a workload with meaningful long-tail distribution. Larger eval sets reduce variance. The right answer depends on the diversity of the production traffic.

How often should the routing rule be re-evaluated?

Quarterly at minimum. More often if model providers release new versions, if production traffic shape shifts materially, or if the policy posture changes. Frontier model releases at the major providers happen more often than quarterly; the eval loop should track them.

What about routing across providers, not just across models within a provider?

The gateway pattern supports cross-provider routing. The exit-strategy benefit under regulatory regimes like DORA is real: a workload that can route across providers reduces concentration risk and supports the regulatory mandate to maintain provider substitutability.

Model Routing for Cost: What to Actually Measure Before Switching a Workload from GPT-4 to Haiku

The "save cost by routing to a cheaper model" pattern is everywhere in engineering blogs. The pattern is real. The execution gets sloppy because most posts treat the decision as a token-price comparison and stop there. A workload that worked on GPT-4 and now runs on Haiku, or one that worked on Claude Opus and now routes to a smaller model, can save 80% of the inference bill or quietly produce a quality regression that takes three months to surface as a customer complaint.

I want to walk through the four layers a platform engineer should actually measure before flipping the routing rule. Each layer answers a question the cheap-model substitution risks getting wrong. The fourth layer is the one the cost-optimization posts almost always skip.

Layer 1: token cost

The cheapest layer to measure, and the only one that gets measured by default. Per million input tokens and per million output tokens, the smaller model is cheaper. The cost ratio is real.

The trap at this layer is the assumption that the input-output token mix on the new model matches the mix on the old. It often does not. A smaller model that needs more retry attempts, more reasoning tokens, or more user prompts to complete the same task has higher effective tokens per task even at a lower per-token price.

The right measurement is dollars per completed task at production-realistic distributions. A capture of the current workload's token sequences run against both models with the same task definition is the input. The output is the actual dollar delta, not the headline rate delta.

Layer 2: quality regression

The harder layer. Token costs are a number. Quality is a distribution.

The measurement requires an eval set. A few hundred labeled inputs that represent the production task distribution. The labels are the answers the team has already validated as correct for the current workload. Each candidate routing rule (current model, smaller model, mixed) runs against the eval set. The metric is task success rate, weighted by the labels' production frequency.

Quality regression on smaller models follows patterns. The smaller model handles common cases as well or better than the larger one. The long-tail cases (rare formats, edge entities, ambiguous instructions, multi-step reasoning) degrade. The degradation is invisible in a small sample because the long tail is the long tail. A 500-sample eval is the minimum that surfaces the patterns.

Quality regression at production also varies by user segment. The smaller model performs differently on prompts written by a power user than on prompts from a new user. The user segment distribution at production matters for the eval design.

The right output of this layer is a "where is the new routing safe" map. Some task subtypes are safe to route. Others are not. The map drives the routing rule.

Layer 3: latency impact

Smaller models are faster in raw inference. The latency story is more nuanced when the routing rule produces retries or fallbacks.

A routing rule that sends to a smaller model first and falls back to the larger model on low-confidence outputs has two paths. The fast path saves latency. The fallback path doubles inference time (small model plus large model) and may exceed the original baseline.

The right measurement is the latency distribution at percentiles, not the mean. P50 may improve. P95 may degrade if the fallback path is common enough. P99 tail behavior matters for user-facing applications.

Latency also interacts with budget. A retry-heavy routing rule that improves cost on the median request but blows up on the tail can produce overall savings or overall regressions depending on the request distribution.

Layer 4: governance risk

The layer most cost optimization posts skip. Different models have different data handling, different default training behavior, and different policy postures.

Three specific risks at this layer:

Data residency and processing terms

The current model's enterprise tier may have residency commitments (data stays in US, EU, or a specific region). The alternative model's commercial tier may not. A routing rule that sends a production workload from a residency-committed endpoint to a commercial endpoint moves data across the residency boundary. The compliance implications depend on the data classification of the workload.

Training data inclusion

The current model's enterprise tier may exclude customer prompts from training. The alternative model's commercial tier may not. Most providers honor opt-out on enterprise contracts; the same provider's commercial API may default to inclusion. The routing decision changes the customer's prompt content from "not training material" to "training material."

Sensitivity tolerance

Smaller and cheaper models often have different safety tuning than their larger siblings. A prompt that the larger model refused or sanitized may pass through the smaller model unmodified. A prompt that the larger model handled appropriately may trigger the smaller model into a low-quality response that includes sensitive content the larger model would have redacted.

The right control at this layer is policy enforcement at the routing decision point, not at the application. The application asks for a task to be done. The policy layer decides which model is appropriate given the data classification, the identity context, and the residency requirements.

Where the routing rule lives

The routing rule has to live somewhere that has all four layers' inputs. The application has the task definition. The provider has the token cost. Neither has the eval data, the policy context, or the audit visibility to make the decision well.

The architectural choice is to route through a gateway that sits on the AI request path. The gateway has identity context, policy context, and the per-request visibility to record the routing decision. The gateway can also enforce constraints: a request tagged as containing financial NPI routes only to a residency-committed model; a request from a power-user identity routes through a higher-quality path; a request that meets cost-optimization criteria and has no sensitivity constraints routes to the cheaper model.

The example rule below is what a gateway-level routing policy looks like.

The rule mixes cost optimization with the governance constraints. The audit fields make the routing decisions reviewable.

What to measure post-deployment

A routing rule that passed the pre-deployment evaluation can still produce surprises in production. The post-deployment measurement loop closes the gap.

Three metrics belong on the routing dashboard.

Effective cost per completed task. The dollar delta against the pre-routing baseline. A negative delta means the rule is producing the expected savings. A positive delta means retries or fallbacks are eating the savings.

Quality regression rate by task subtype. The eval set re-run against production samples on a weekly cadence. Subtypes that degrade more than the threshold trigger a routing-rule review.

Governance exceptions. The count of requests blocked or re-routed for residency, training-data inclusion, or sensitivity reasons. A nonzero rate confirms the policy layer is working; a sudden change is a signal that traffic shape has shifted.

Compliance posture

For workloads in regulated industries, the routing decision has audit weight. EU AI Act Article 12 logging captures the model used per decision. NIST AI RMF MEASURE asks for tracked performance against benchmarks. ISO 42001 expects documented operational planning. A routing rule that flips silently between models without an audit trail fails all three.

The audit fields the gateway writes per request answer the regulatory questions. Which model handled this decision? Why was it routed there? What policy applied? The answers are in the per-decision record, not in the application's billing dashboard.

DeepInspect

This is exactly what DeepInspect does. DeepInspect sits inline between authenticated users and agents and any HTTP-based LLM endpoint. The routing rule expressed above is the literal policy primitive the product supports. The audit record per request captures the routing decision and the inputs to it. The model-agnostic property means the rule can route across providers, not just across models within a single provider.

If you are evaluating a cost-optimization model-routing project and want to see the gateway-level routing pattern applied with the four-layer rigor, book a demo today.