Fail-Closed vs Fail-Open AI Gateway: The Decision That Survives the Post-Incident Review
A gateway whose policy plane is unreachable has to decide whether to forward the request anyway or refuse it. The decision is architectural and the cost of getting it wrong shows up as either a regulatory finding or a production outage. Fail-closed for policy decisions and fail-open for cached bundles is the pairing that survives the post-incident review. This walkthrough covers the failure modes, the per-route defaults, and the operational runbook for the brief window where the policy plane is unavailable.

A policy plane that resolves a decision in under a millisecond on the happy path resolves zero decisions when the registry is unreachable. The gateway has to decide what to do during the brief window of unavailability. Forwarding the request anyway is fail-open; refusing it is fail-closed. The choice is architectural and the cost of getting it wrong is either a production outage or a regulator's finding that the policy was advisory rather than enforced.
I want to walk through the failure modes, the per-route defaults that hold under load, and the operational runbook for the unavailable window.
What "fail-closed" actually means
Fail-closed means the gateway refuses the request when it cannot resolve the policy. The model receives no traffic; the application receives a structured error; the audit row records the refusal with the cause. The model layer remains unbothered; the user sees a degraded but bounded experience.
The fail-closed posture is what the regulator's question resolves against. EU AI Act Article 12 requires automatic recording of events; an "allow on registry timeout" rule means events of unknown policy status reach the model. The audit row that records "policy resolution timed out, allow forwarded" is the artifact that breaks the audit chain.
What "fail-open" actually means
Fail-open means the gateway forwards the request when it cannot resolve the policy. The user's chat completion arrives. The audit row records the forwarding with a timeout marker. The model layer remains available; the regulatory chain is broken for the duration of the unavailability.
Fail-open is sometimes the right answer for a specific route. The right pairing is per-route fail-open with explicit governance approval, narrow scope, and a budget that prevents the route from being the rule rather than the exception.
The cached-bundle bypass
The pairing that survives is fail-closed for new decisions and fail-open for cached bundles.
Each worker holds a local cache of the policy bundle it received at startup. The cache is the source of truth for decisions on that worker until the next pull. When the registry is unreachable, the worker continues to evaluate against the cached bundle. New deployments cannot proceed (the next pull fails), but in-flight requests resolve against the bundle the worker already has.
The cache TTL bounds how long the worker is willing to serve under registry outage. A TTL of 24 hours is typical; longer values increase the risk of stale rules; shorter values increase outage exposure.
Per-route fail-open exceptions
A small set of routes may justify fail-open. Examples include a routing rule that itself has no policy stake (such as a health-check endpoint) or a non-sensitive route where availability outweighs the policy enforcement value.
The allowed_reason field is what the auditor reads when asking why this route bypasses the policy plane. Every fail-open route has an explicit reason and the rule is reviewed in the same change-management process as any other policy change.
The decision that breaks both ways
The choice that breaks both ways is "default fail-open." Default fail-open during a registry outage means a routine network issue degrades into an enforcement bypass for every route. The audit log fills with "policy timeout, allowed" rows; the SOC sees no signal because the gateway is reporting success; the regulator sees a record where every decision during the outage window has no recorded policy version.
Default fail-closed during a registry outage degrades into a brief user outage that the SOC sees as 5xx responses. The user is bounded; the policy chain holds.
The default that holds is closed.
The operational runbook
The runbook for the unavailable window has four steps.
The first step is detection. The gateway's metrics expose registry-reach failures, cache-age distribution, and per-route fail-mode counters. An alert at the first registry-reach failure surfaces the problem before the cache TTL becomes a factor.
The second step is the cache-age budget. The worker continues to serve from cache until either the registry recovers or the TTL is exceeded. The operator's question is whether to extend the TTL during an active incident; the answer is no, because extending the TTL is a policy change and the change-management process for policy changes is the same as any other.
The third step is the fail-mode posture under the longer outage. Routes default to closed; the limited set of fail-open routes continues to serve. The user-facing degradation is bounded.
The fourth step is post-incident. The audit pipeline records the outage window. The post-incident review confirms that fail-closed routes refused requests as expected and that fail-open routes were within their declared scope. The review feeds back into the route-level fail-mode declarations.
What the audit record looks like during the outage
The audit row during the outage window carries the fail mode and the cause.
The auditor reads the row and sees that the gateway refused the request during a brief outage. The decision is consistent with the route's declared fail mode; the chain holds.
How this pairs with policy versioning
A policy version that has been pulled into the worker's cache continues to serve. A policy version that has not yet been pulled cannot be served, because the worker does not have the bundle. Fail-closed-on-new-versions is the property that prevents an unverified bundle from being implicitly trusted by a worker that happened to be online when the new bundle was pushed.
DeepInspect
DeepInspect runs fail-closed by default. Every route declares its fail mode; the default is closed; fail-open routes have explicit allowed_reason text and go through the same review as any other policy change. The worker cache serves through brief registry outages; the audit record carries the fail mode and the cause.
The gateway's enforcement overhead measures under 50 ms p95 on the happy path from internal DeepInspect testing. The closed-during-outage posture is what the regulator's traceability question resolves against. Book a technical deep dive at deepinspect.ai to walk through fail-mode declarations against your current route catalog.
Frequently asked questions
- What happens to user requests during a fail-closed outage?
User requests on closed routes receive a structured error. The application surfaces a degraded message such as "AI is temporarily unavailable, please retry shortly." The chat resumes when the gateway resolves the policy again.
- Can a single route flip from closed to open during an incident?
The supported path is a pre-declared mode change in the policy plane, reviewed and rolled out through the same change-management process. The unsupported path is a runtime override; runtime overrides break the audit chain and the gateway refuses them.
- What about regional registries?
Multi-region registries reduce the probability of total unavailability. Each region operates a registry replica; workers prefer their local replica and fail over to a peer region. The fail-closed default still applies; multi-region just narrows the window.
- How does this interact with rate limiting?
Rate limits are enforced from the cached bundle. A worker that has the bundle continues to enforce rate limits during a registry outage. A worker that has not yet received a new rate-limit version continues to enforce the old version until the next pull.
- What about emergency policy pushes?
Emergency policy versions are labeled and approved through the expedited path. The push reaches the registry; workers pull on their next interval; until the pull, the worker enforces the previous version. The audit record carries the version that actually applied.