What latency overhead should we expect from the gateway?

End-to-end enforcement overhead measures under 50 ms in production tests. LLM inference takes 500 ms to 5 seconds. The overhead is small relative to the model's own response time. The 50 ms figure is internal DeepInspect testing and should be validated against the deployer's own workload during a proof of concept.

How does the gateway behave during a model provider outage?

The gateway's role is to enforce policy and produce the audit record. When the upstream model API is down, the gateway returns the upstream error to the caller with the audit record reflecting the upstream failure. The gateway itself does not retry indefinitely; retry logic is the application's concern. Some deployers configure the gateway to fall back to a secondary model provider, with the policy re-evaluated against the secondary provider's endpoint.

Can the gateway run alongside an existing API gateway like Kong or Apigee?

Yes. The traditional API gateway handles transport, rate limiting, and TLS termination at the network boundary. The AI gateway sits behind it and handles AI-specific policy at the AI request boundary. The two layers complement each other. The traditional gateway does not see prompt content. The AI gateway does.

How is the policy synchronised across instances?

The policy lives in a shared configuration store. Each gateway instance reads from the store at startup and listens for updates. Updates propagate within seconds. The store is the source of truth. A gateway instance with stale configuration is detected at the health check layer and pulled from rotation until it catches up.

What is the impact of fail-closed on user experience?

A fail-closed event is visible to the user as a "this AI feature is temporarily unavailable" experience, with a retry path. The deployer chooses the user-facing copy. The trade-off is that a momentary gateway issue produces a user-facing error rather than an unenforced model call. For deployers in regulated sectors, the trade-off is correct. For deployers outside regulated sectors, the trade-off is a deliberate

AI Gateway High Availability: The Failure Modes That Matter and the Topology That Survives Them

An AI gateway sits inline between the user or agent and the LLM API. Every AI request passes through it. When the gateway fails, the AI traffic either stops (fail closed) or bypasses the gateway (fail open). Both choices have costs, and they are different costs. A deployment that picked the lower-friction choice during initial rollout often discovers, at the worst possible time, that the choice did not match the regulatory posture.

I want to walk through the failure modes that matter at the AI request boundary, the topology patterns that survive them, and the fail-closed vs fail-open trade-off as it actually plays out under regulatory pressure.

The failure modes that matter

A gateway is a stateless proxy on the data path. Five failure modes account for almost all incidents.

Single-instance crash

One instance of the gateway crashes. If the load balancer holds the healthy instances and the request is retried, the impact is a small latency tick. If the deployment is single-instance, every request fails until the instance recovers.

Region or AZ outage

The cloud region or availability zone hosting the gateway goes down. Multi-AZ deployments within the same region survive AZ outages. Multi-region deployments survive region outages. Single-region single-AZ deployments do not.

Policy decision point latency tail

The policy decision point takes longer than its budget. Inspection adds latency. A pathological prompt (very long, very complex, or adversarially crafted to exhaust the inspector) can push the decision time into the seconds. The model API call has its own latency and the user expects a bounded total. The PDP has to bound its own.

Upstream model API outage

The model provider (OpenAI, Anthropic, Bedrock) has an incident. The model API returns errors. The gateway has nothing to proxy. The gateway's own health is fine. The user experience is degraded because the underlying service is down.

Configuration mis-push

A policy update goes out with an error. Every request now matches a too-strict rule and gets blocked. The gateway is healthy. The decisions are wrong. The deployer's traffic stops.

Topology patterns that survive each mode

The topology depends on the deployer's posture and the volume of traffic.

Active-active across availability zones in one region

Two or more gateway instances in different AZs, behind a load balancer that health-checks each instance. Survives single-instance crashes and single-AZ outages. Does not survive a regional outage. Suitable for most enterprise deployments that pin to one region for data residency.

Active-active across regions

Gateway instances in two or more regions, each fronted by its own load balancer, with global DNS routing the request to the nearest healthy region. Survives single-instance, AZ, and regional outages. Adds the operational cost of running in multiple regions and the data-residency consideration that some prompts cannot cross borders.

Sidecar-per-application

The gateway runs as a sidecar to each application that calls the LLM. The deployment scales with the application. Survives most centralized-gateway outages because the sidecar is co-resident with the application. Trades the operational simplicity of a centralized layer for per-application footprint.

Edge-deployed for global low latency

The gateway runs at the edge (CloudFlare Workers, Fastly Compute, regional POPs) and connects to the model API from the nearest model endpoint. Survives most failure modes through global redundancy. The trade-off is that edge environments constrain what policy logic the gateway can execute.

The fail-closed vs fail-open trade-off

When the gateway cannot make a decision (instance unavailable, PDP timeout, configuration error), it has to choose one of two postures.

Fail closed

The request is rejected. The model API is not called. The user sees an error or a fallback experience. The deployer's policy is enforced even when the gateway is unhealthy, because the absence of a positive decision is treated as a negative decision.

Fail open

The request is allowed through. The model API is called. The user gets the response. The deployer's policy is bypassed because the absence of a positive decision is treated as a positive decision.

The trade-off looks symmetric on paper. In production it is not.

Why fail-closed is the default for regulated environments

For deployers under EU AI Act, HIPAA, PCI, SOX, or any sector-specific regime, the policy is the regulatory boundary. A failure to enforce the policy is a regulatory exposure. The audit record for a fail-open event reads "the policy was not evaluated, the request was permitted anyway." The regulator does not accept the explanation. Fail-closed produces an audit record that reads "the policy could not be evaluated, the request was denied, the gateway recovered at this timestamp." The regulator accepts the latter.

Why fail-open shows up anyway

Fail-open creeps in through availability pressure. The CEO demos the AI feature to the board. The gateway has a problem. The team flips the gateway to fail-open for the duration of the demo. The flip never gets reverted. Six weeks later the deployer's compliance posture has a hidden gap and no one knows.

How to keep fail-closed sustainable

Three properties make fail-closed sustainable. First, gateway availability targets that genuinely match the application's. A 99.9% AI gateway is incompatible with a 99.95% application. Second, transparent degradation: the user is told the AI feature is unavailable, with a clear retry path, rather than a generic error. Third, an explicit, time-bounded fail-open override for incident response, logged at the gateway and reviewed within 24 hours.

Operational practices that hold up

A high-availability AI gateway is not a topology choice. It is an operational discipline.

Synthetic traffic in production

A continuous stream of synthetic requests through the gateway, executed against test policies that exercise the inspection path. The synthetic stream catches configuration regressions and PDP-latency regressions before real users encounter them.

Canary deployments for policy updates

Policy updates roll out to a small fraction of traffic first. The canary's block rate and latency are compared against the baseline before the rollout proceeds. A misconfigured policy that would block 100% of traffic gets caught at the canary stage.

Health checks that exercise the decision path

The load balancer's health check has to exercise the policy decision path, not just the gateway process. An instance that responds to TCP but has a broken PDP is unhealthy from the user's perspective and has to be marked unhealthy by the load balancer.

Time-bounded fail-open with audit

If fail-open is permitted at all, it is time-bounded and logged. Every fail-open event produces an audit record describing why and for how long. The deployer's compliance function reviews the records.

Multi-region testing once per quarter

A scheduled failover test pulls traffic to the secondary region, confirms the policies match, confirms the audit record stream is intact, and confirms the application sees the expected behavior. Quarterly cadence catches drift before a real incident exposes it.

DeepInspect

This is exactly what DeepInspect does. DeepInspect sits at the AI request boundary as a stateless proxy. The proxy is horizontally scalable, fronted by load balancers, deployable in active-active topology across availability zones and regions. The default posture is fail-closed: when the policy decision point cannot reach a verdict, the request is denied. Fail-open is available as an explicit, time-bounded override.

The proxy is stateless because the audit record commits to durable storage before the response returns. A crashed instance does not lose decisions in flight. A new instance picks up the load and the policies are evaluated from the shared configuration store. Policy updates roll out through canary deployments. Health checks exercise the policy decision path.

For the deployer, the architecture matches the compliance posture. The default keeps the regulatory boundary intact. The override is auditable. The topology choice is the deployer's, not the proxy's: single-region for data residency, multi-region for resilience, sidecar for per-application footprint.

If you are deploying AI in a regulated environment and your gateway's failure mode silently allows traffic the policy would have blocked, the regulator will find the gap during the postmortem. Book a demo today.