AI Gateway Deployment Patterns: Four Topologies and When Each One Fits
Where an AI gateway sits in the network topology determines what it can enforce and what it can record. Four deployment patterns dominate production: inline reverse proxy in front of the model, sidecar to the agent runtime, in-region replicas for low-latency multi-region, and dedicated tenant gateway per customer in a multi-tenant SaaS. This piece walks through the four, what each enforces, what each records, and the operational trade-offs.

Where the AI gateway sits in the network topology decides what it can enforce and what it can record. The same gateway code can sit in four positions in a typical enterprise deployment, each producing a different security and operational profile. The right pattern depends on the regulated workload, the multi-region constraints, the tenancy model, and the agent-vs-application traffic split. Most deployments end up running more than one pattern in different parts of the topology, which is operationally fine as long as the records share a common schema and the policy is consistent across deployments.
I want to walk through the four patterns most production deployments converge on, what each enforces and records, and the conditions under which each is the right choice.
Pattern 1: Inline reverse proxy in front of the model
The gateway sits as a reverse proxy between the application or agent and the model provider. Every outbound LLM call routes through the gateway. The gateway terminates the TLS connection, evaluates the policy, signs the request to the upstream model provider with the corporate API key, and forwards the call. The response returns through the gateway, which records the decision before returning to the application.
What it enforces
Identity-bound policy at the request layer. Data classification on the prompt content. Per-decision audit records with the eight fields the converged regulatory landscape requires. The gateway is the only path to the model provider; the application cannot bypass it without a network policy violation.
What it records
Per-decision records signed and tamper-evident, committed before the application receives the response. The records have full identity context if the application attaches a signed delegation token, partial context if the application attaches only a service identity.
When this pattern fits
Single-region deployments where latency between the application and the model is dominated by the model's own inference time. Workloads where the regulatory obligation is the central concern and the operational simplicity of a single gateway is more important than the absolute lowest latency.
Trade-offs
A network failure between the application and the gateway, or between the gateway and the model provider, takes the model offline. High availability on the gateway tier is necessary. The gateway's egress IP shows up as the source on the model provider's side, which means rate limits at the provider are applied per gateway pool, not per application.
Pattern 2: Sidecar to the agent runtime
The gateway runs as a sidecar process or container alongside the agent runtime. Every LLM and tool call the agent makes routes through the sidecar over a local socket or loopback connection. The sidecar evaluates the policy and forwards to the upstream model or tool.
What it enforces
Identity-bound policy with workload identity from the runtime's own identity provider integration. Per-call evaluation against the sidecar's local policy cache, which refreshes from the central policy service.
What it records
Per-decision records produced by the sidecar and shipped to a central record store. The records are signed at the sidecar before transmission. The central store deduplicates and provides the audit query surface.
When this pattern fits
Agent runtimes where the call volume is high and the latency of a remote gateway hop is a concern. Multi-region deployments where putting the policy enforcement next to the agent simplifies the data-residency story (the prompt content never leaves the region the agent runs in).
Trade-offs
The sidecar must be deployed and lifecycle-managed alongside every agent runtime. The policy cache must refresh frequently enough that policy changes take effect within the operational SLA. The records must reach the central store reliably, which usually means a local buffer with retry.
Pattern 3: In-region replicas for low-latency multi-region
The gateway runs as replicas in each region the application or agent runs in. Each replica handles traffic local to its region. Policy is centrally managed and replicated to each region. Records are written locally and shipped to a central store.
What it enforces
The same identity-bound policy as Pattern 1, applied in the region the call originates from. Cross-region policy consistency depends on the replication lag of the policy store.
What it records
Per-decision records produced in each region and shipped to a central store for cross-region audit query. The records remain in-region for the regulatory data-residency requirements that apply to many EU deployments.
When this pattern fits
Multi-region deployments where latency between application and gateway must remain under a low ceiling (typically under 10 ms). EU deployments with data-residency requirements where the prompt content cannot leave the region.
Trade-offs
Policy consistency across regions depends on the replication lag of the policy store. Records consistency across regions depends on the central record store's availability. Operational complexity scales with the number of regions, but the per-region operational cost is roughly fixed.
Pattern 4: Dedicated tenant gateway per customer in multi-tenant SaaS
The SaaS provider runs a dedicated gateway per customer tenant. Each tenant's traffic routes through that tenant's gateway with the tenant's policy and the tenant's records. The provider operates the gateway pool; the customer's data and records are isolated at the gateway tier.
What it enforces
Per-tenant policy with tenant-specific identity bindings. Cross-tenant isolation at the gateway layer, which is stronger than relying on application-level isolation.
What it records
Per-tenant records signed with tenant-specific keys. The records are accessible to the tenant for their own audit purposes and to the provider for operational monitoring.
When this pattern fits
Multi-tenant SaaS platforms where the underlying application embeds AI features and the customers have independent compliance requirements. Healthcare SaaS where the BAA with each healthcare customer imposes per-customer audit obligations. Financial SaaS where each customer faces independent regulatory regimes.
Trade-offs
Operational complexity scales with the customer count. The provider needs tooling to provision new gateways, manage their lifecycle, and aggregate records across the fleet without breaking tenant isolation. The per-customer cost is the main constraint at scale.
Choosing among the four
Most production deployments converge on Pattern 1 first, then add Pattern 3 for multi-region, then add Pattern 2 for high-volume agent runtimes, then add Pattern 4 if the deployment is multi-tenant SaaS. The patterns can coexist as long as the records share a common schema and the policy is consistent across deployments. The gateway code does not change across patterns; what changes is where the gateway runs and which traffic it sees.
Records consistency across patterns
The records produced by all four patterns share the same eight-field schema: natural-person identity, agent identity, role and scopes, data classification, policy version, model and route, decision outcome, tamper-evident timestamp. A central record store can ingest records from all four patterns and produce a unified audit query surface.
Policy consistency across patterns
The policy in effect at each gateway should be identical for the same logical scope. The policy is managed centrally and distributed to each gateway. Policy changes propagate within the operational SLA (typically minutes). The records carry the policy version so any record can be reconstructed against the policy that was in effect at the moment.
DeepInspect
This is the architecture DeepInspect was built to provide. DeepInspect ships in all four deployment patterns with a common policy and records schema. The deployment choice depends on the regulated workload, the latency requirements, the multi-region constraints, and the tenancy model. The records are consistent across patterns. The policy is consistent across patterns. The audit query surface is unified.
For a single-region deployment, Pattern 1 ships first. For a multi-region deployment with data-residency requirements, Pattern 3 follows. For high-volume agent runtimes where latency to a remote gateway is a concern, Pattern 2 ships as a sidecar. For multi-tenant SaaS with independent customer compliance, Pattern 4 ships per tenant.
Book a demo today.
Frequently asked questions
- Can I run multiple patterns in the same deployment?
Yes. Most production deployments run more than one pattern in different parts of the topology. The customer-facing application might use Pattern 1, the internal agent runtime might use Pattern 2 as a sidecar, the EU region might use Pattern 3 for data-residency, and the SaaS product layered on top might use Pattern 4 for tenant isolation. The records and policy stay consistent because they share a common schema and a common policy distribution. The operational cost of running multiple patterns is dominated by the records aggregation, which the central record store handles.
- How does the sidecar pattern handle policy updates?
The sidecar runs a local policy cache that refreshes from the central policy service on a configurable interval (typically every 30 to 60 seconds). The cache is consistent within the refresh interval. Policy changes take effect at the sidecar within one refresh interval of being committed at the central service. The records carry the policy version at the moment of decision so any record can be reconstructed against the exact policy that was active. Operational SLAs are typically expressed as the time from policy commit to enforcement at every sidecar, which is the refresh interval plus the propagation latency from the central service.
- What's the failure mode if the gateway is down?
The default is fail-closed: if the gateway cannot evaluate the policy, the upstream LLM call fails. The application must handle the failure (typically by surfacing an error to the user). Fail-open mode (forward the call without policy enforcement) is configurable and not recommended for regulated workloads because the records gap on the unenforced traffic creates compliance exposure. High availability on the gateway tier is the operational answer to the failure mode. Patterns 1 and 3 run gateway pools with health checks and load balancing. Pattern 2 runs the sidecar with the same lifecycle as the agent runtime, so the sidecar's availability matches the agent's.
- How do I handle data-residency for EU customers?
Pattern 3, in-region replicas, is the standard answer. The gateway runs in the EU region, the prompt content is evaluated in-region, the records are written in-region, and the upstream model call goes to an EU-region model endpoint. Cross-region operational traffic (policy updates from the central policy service, records shipping to the central record store) is metadata, not customer data, and is typically permitted under EU data-residency rules. The records can also be configured to remain entirely in-region with a per-region query surface that the auditor queries directly.
- Does the deployment pattern affect the compliance posture?
The deployment pattern affects the records granularity and the data-residency story. Patterns 1 and 3 produce identical records at different network positions. Pattern 2 produces identical records at the agent runtime. Pattern 4 produces per-tenant records with tenant-isolation properties. All four satisfy the same regulatory ask (per-decision records with the eight fields). The pattern choice is driven by latency, multi-region, and tenancy considerations. The compliance posture is consistent because the records are consistent.