How many fallback attempts should the router try?

Two or three total attempts (primary plus one or two fallbacks) usually covers provider outages and rate-limit spikes. More than three adds latency and complicates idempotency reasoning. Deployers running batch workloads sometimes accept longer chains because the latency cost is lower.

What happens if all fallback endpoints are outside the authorized set?

The router fails the request. The audit record shows the deny outcome with a reason field indicating that no authorized endpoint was reachable. The caller receives an error response. The operations team is expected to add authorized endpoints to the fallback chain rather than the router bypassing the authorization.

Should fallback trigger on high latency?

Latency-triggered fallback works for interactive workloads with strict SLA. It requires an active health signal on each upstream (rolling p95 latency) and a threshold that triggers the fallback before the caller's own timeout fires. The signal comes from the router's operational monitoring. The authorized-endpoint constraint still applies.

How does streaming interact with fallback?

Streaming complicates fallback. The router that streams the primary response to the caller cannot cleanly retry against the secondary if the primary fails mid-stream. Common approaches: buffer the primary stream up to a threshold before sending to the caller, or send the primary stream immediately and fall back with a fresh secondary request if the primary breaks. The choice depends on how much perceived latency the caller can tolerate.

Does fallback bypass rate limits?

Fallback to a different provider bypasses that provider's rate limit but does not bypass the deployer's own per-caller or per-tenant rate limits. The gateway's rate limit applies regardless of which upstream the router chooses. Deployers sometimes reason that fallback increases capacity; it does, at the provider layer, but not at the policy layer.

Can I fall back to a self-hosted model?

Yes, provided the self-hosted endpoint sits inside the authorized-endpoint set for the request's classification. Self-hosted models are common tertiary fallbacks for cost and availability. The BAA question does not apply because the deployer owns both sides of the connection, but data-residency rules and internal segmentation policies still apply.

LLM fallback routing: the retry chain that survives provider outages without leaking policy

An LLM fallback routing chain retries a failed request against a secondary model when the primary returns an error, times out, or trips a rate limit. The pattern shows up in every production deployment running more than one provider. The pattern also produces a specific class of compliance bug: a fallback that reaches an endpoint the caller was not authorized to reach. The bug is invisible in operational metrics because the request succeeded from the caller's perspective. It shows up in the audit record, at exactly the moment a regulator asks which model handled a specific request.

I want to walk through the four common triggers for fallback, the retry semantics per trigger, the authorized-endpoint constraint the gateway imposes, and the idempotency requirements for tool-calling workloads.

The four common triggers

Not every upstream error is a fallback candidate. Distinguishing the triggers determines whether retry is safe and which fallback endpoint the router selects.

Trigger one: provider outage

The primary endpoint returns 5xx, times out, or returns a connection error. The provider is genuinely down. Fallback to a different provider or region is the correct response. Retry semantics: safe. The primary produced no output; the secondary can retry the same request without duplicate-side-effect risk.

Trigger two: rate limit

The primary returns 429 with a Retry-After header. The provider is up but the deployer has exhausted the current quota. Fallback to a secondary endpoint (or waiting for the retry window) is the correct response. Retry semantics: safe. The primary produced no output.

Trigger three: model degradation

The primary returns a response, but the response fails a downstream quality check (schema validation failure, content-policy trigger, output too short, output too long, contains a hallucinated tool call). Fallback to a different model with different quality characteristics is the correct response. Retry semantics: complicated. The primary produced output that the caller might have already partially consumed if the response was streamed.

Trigger four: policy denial

The primary endpoint returns a content-policy refusal ("I can't help with that"). Fallback to a different model is not the correct response most of the time; it is an attempted policy circumvention. The router should treat policy denials as terminal, not as fallback candidates. This trigger is a common source of unintended compliance drift when the router treats "I can't help with that" the same as a 500 error.

Retry semantics per trigger

The retry decision depends on whether the primary produced output before failing.

No-output triggers: retry is safe

Provider outage and rate limit produce no output. The router retries the same request against the secondary. The audit record captures both the primary failure and the secondary success under the same request ID.

Partial-output triggers: retry requires idempotency handling

Model degradation can happen after the primary streamed some tokens to the caller. The router that retries after partial output either buffers the primary output and discards it (increasing perceived latency), or the caller sees a truncated primary response followed by a full secondary response. Neither behavior is uniformly correct. The right answer depends on the workload: interactive chat usually buffers, batch generation usually retries from scratch.

Tool-call triggers: retry requires the caller's idempotency key

If the primary already produced a tool call that the caller executed, retry produces duplicate side effects. The classic example: the primary calls create_order, the caller executes the order, the response fails a schema check, the router retries against the secondary, the secondary calls create_order again. The customer sees two orders. The mitigation is caller-supplied idempotency keys attached to tool calls, so downstream tool servers reject the duplicate.

The authorized-endpoint constraint

The gateway's permit decision authorizes the request to reach a specific set of upstream endpoints. Every model in the fallback chain must sit inside that authorized set.

Why the constraint matters

The permit decision often depends on the data classification of the prompt. A PHI-classified prompt is authorized to reach BAA-covered endpoints and forbidden to reach non-BAA endpoints. A fallback chain that includes a non-BAA endpoint as the tertiary fallback will, under a primary and secondary failure, silently route PHI to a non-BAA endpoint. The compliance violation is real even if the caller never notices.

The router's implementation

The router reads the authorized-endpoint set from the permit decision at request time. For each candidate endpoint in the fallback chain, the router checks membership in the authorized set. Endpoints outside the set are skipped. If the fallback chain is exhausted within the authorized set, the router fails the request and produces an operational log entry indicating that no authorized fallback succeeded.

The failure mode this prevents

Without the constraint, the router routes based on health, cost, or latency signals that have no visibility into the policy. A rate-limited primary triggers a fallback to whichever endpoint the router thinks is healthiest. The healthy endpoint might be a self-hosted model that has no BAA. The audit record shows a PHI prompt served by a non-BAA endpoint. The regulator asks why.

Idempotency for tool-calling workloads

Tool-calling workloads add a second dimension to fallback. The tool call sits inside the LLM response and executes against the deployer's own systems. Retry semantics for the LLM alone do not cover retry semantics for the tool call.

The idempotency key pattern

Every tool call the LLM produces carries an idempotency key derived from the request ID plus a per-tool nonce. The downstream tool server (order-management, ticketing, database) rejects duplicate keys. When the router retries after a partial LLM failure, the second LLM might produce the same tool call with the same key, and the tool server rejects it. The behavior is deterministic: at most one execution per idempotency key.

The generation constraint

The LLM does not always generate the same idempotency key on retry. To make the key deterministic, the router or the gateway attaches the key to the request context before the LLM sees the prompt, and the tool-call generation template references the context key rather than generating a fresh one. This requires cooperation from the tool-call prompt template.

The tool-server constraint

Tool servers that do not implement idempotency-key rejection accept duplicate calls silently. The router's retry becomes the tool server's duplicate execution. Deployers running fallback with tool-calling workloads should audit their tool servers for idempotency support before turning on fallback.

Beyond the router

Fallback routing is one piece of a broader reliability story. Circuit breakers, request buffering, provider health scoring, and graceful degradation all interact with the fallback chain. The Sanity checks per piece: does the fallback endpoint sit inside the authorized-endpoint set, does the retry semantics match the trigger, and are downstream tool servers idempotent?

DeepInspect

This is where DeepInspect's architecture matters for fallback. DeepInspect produces the permit decision that carries the authorized-endpoint set and the classification tag. Every fallback attempt the router makes reads that set. A rate-limited primary triggers a fallback within the set; a policy-denied primary terminates. When the fallback chain is exhausted within the authorized set, the router fails cleanly.

Every fallback attempt produces a per-decision audit record with the request ID, the authorized-endpoint set, the endpoint attempted, the trigger, and the outcome. When a regulator asks which model served a specific request, the audit record answers with the endpoint that succeeded and the fallback chain that was attempted. The evidence is admissible and does not depend on the router's operational log.

Book a demo today.