Can the gateway be deployed as a sidecar to existing application services?

Yes. The gateway can run as a sidecar per application service, with the application calling localhost: for the LLM endpoint and the sidecar handling TLS, identity, inspection, policy, and routing. The sidecar pattern keeps the gateway close to the application and simplifies the network topology. Programs at scale typically run a centralized gateway cluster behind a load balancer instead of per-service sidecars.

How does the gateway handle streaming responses?

LLM streaming responses (server-sent events on the Chat Completions API, streaming on Bedrock and Anthropic) flow through the gateway with response-side inspection running on the assembled response as the stream completes. The gateway can also enforce streaming-time policy (cut the stream if a forbidden category appears in the response), which adds a per-chunk inspection step.

Does the gateway add latency that affects production performance?

End-to-end inspection overhead measures under 50 ms in internal testing. LLM inference itself takes 500 ms to 5 seconds depending on the model and the request size, so the inspection overhead sits inside the round-trip variance. Programs running latency-sensitive applications typically calibrate the gateway's inspection thresholds during the first thirty days against real traffic.

Can the gateway route to self-hosted open-weight models?

Yes. The gateway is model-agnostic. Self-hosted models (vLLM, TGI, Ollama, internal SageMaker endpoints) appear to the gateway as just another LLM endpoint. The policy and record surface remain the same across hosted and self-hosted models.

How does the gateway interact with existing API gateways the program runs?

The AI gateway sits in front of LLM endpoints. The existing API gateway (Kong, Apigee, AWS API Gateway) typically sits in front of internal services. The two gateways can co-exist with the AI gateway handling LLM-specific concerns (prompt inspection, AI-specific policy, AI audit record) and the existing API gateway handling rate limits, API key management, and general routing.

AI Gateway Architecture: The Components That Sit Between an Enterprise Caller and an LLM Endpoint

An AI gateway is the HTTP proxy that sits between an enterprise caller and an LLM endpoint. The architecture has six core components, and each component is a placement decision that ties to a regulatory obligation, an operational property, or an integration with the existing enterprise stack. The components are TLS termination, identity binding, request inspection, policy evaluation, the model router, and the audit record emitter.

I want to walk through each component, what it does, where it sits relative to the others, and the operational consequences of getting the placement right or wrong.

Component one: TLS termination

The gateway terminates TLS at the inspection point. The TLS termination is what lets the gateway read the request body, which carries the prompt content. Without TLS termination at the gateway, the prompt content remains encrypted to the model provider's endpoint and the inspection cannot read it.

The TLS termination raises a certificate question. The client (the application or the user's browser) needs to trust the gateway's certificate for the LLM endpoint domain. Programs typically handle this with a corporate root certificate the device fleet already trusts, with the gateway issuing leaf certificates for each LLM endpoint the program routes through. The certificate management belongs in the PKI infrastructure the program already operates.

Component two: identity binding

The identity binding step authenticates the caller against the corporate identity provider. The gateway extracts the IdP token from the request (typically a JWT in the Authorization header), verifies the signature against the IdP's public key, and attaches the verified identity to the request context. The identity context flows downstream through the policy evaluation and the audit record emission.

For agentic traffic, the identity binding includes the delegation chain: the agent's service identity plus the originating user identity from the delegation token. The chain carries through the request context for the policy and the record.

The identity binding is the field the EU AI Act Article 19 record references. The placement at the gateway is what lets the identity show up on every record without each application team wiring it through.

Component three: request inspection

The request inspection parses the request body, identifies the prompt fields per the LLM API contract (messages for OpenAI Chat Completions and Anthropic Messages, contents for Google Generative AI, inputText for AWS Bedrock Runtime), and runs the classifier against the prompt content.

The classifier output: category labels (PII, PHI, source code, customer data, organization-defined categories), confidence scores per label, and the span (start offset, length) of each labeled section. The inspection produces a structured classification result the policy can match against.

The inspection runs on the request body and, on the response side, on the model response body. The two inspections share the classifier and the policy surface but emit separate decisions for the request side and the response side.

Component four: policy evaluation

The policy evaluation receives the identity context and the classification result and returns a decision. The policy is a function over the (data, identity, model) triple. The decision is permit, redact, or block; some gateways add escalate for the human-review pattern.

The policy state is versioned. Each evaluation records the policy version that decided. The version is what lets the program reconstruct the decision basis when the policy changes between the original decision and the audit review.

Component five: model router

The model router decides which LLM endpoint receives the forwarded request. The decision can route based on the model the caller requested, the policy tier the caller's identity holds, the model availability state, or the regional jurisdiction the data has to stay inside.

The router is the integration point for multi-provider deployments. The gateway can route OpenAI traffic to api.openai.com, Anthropic traffic to api.anthropic.com, Bedrock traffic to bedrock-runtime.us-east-1.amazonaws.com, and self-hosted traffic to an internal inference endpoint, all under the same policy and record surface.

The router is also the integration point for data residency. A policy can require certain data tenants' requests to route to in-region endpoints, with the routing decision recorded on the audit series.

Component six: audit record emitter

The audit record emitter commits the per-decision record before the response returns to the caller. The record carries the timestamp, identity, classification, decision, policy version, request and response hashes, and an integrity signature chained from the previous record. The series is tamper-evident and queryable for audit sampling.

The emitter writes to a record store (a tamper-evident log, often backed by a database with append-only constraints or a write-once storage tier). The records flow into the SIEM (Splunk, Elastic, Datadog) through standard log forwarding. The compliance team queries the SIEM or the record store directly for audit sampling.

How the components interact on a single request

Integration with the existing enterprise stack

The gateway integrates with three existing systems. The corporate IdP supplies the identity. The SIEM receives the audit records. The PKI fleet trusts the gateway's certificates. The integration patterns are standard: OIDC or SAML for the IdP, syslog or HEC for the SIEM, corporate root certificate for the PKI.

For programs running the gateway in front of multiple LLM providers, the deployment is typically as a sidecar or a cluster behind a load balancer the application tier already uses. The latency overhead measures under 50 ms end-to-end in internal testing, which sits inside the LLM inference variance.

Regulatory framing

The gateway architecture maps to specific regulatory obligations. EU AI Act Article 12 and Article 19 on record-keeping land on the audit record emitter. Article 26 on deployer obligations lands on the policy evaluation and the integration with human oversight. HIPAA Security Rule 45 CFR 164.312(b) on audit controls lands on the record emitter and the identity binding. PCI DSS v4.0 requirement 10 on logging lands on the same record series.

NIST AI RMF MEASURE function lands on the classification and decision metrics the gateway emits. MANAGE function lands on the feedback loop between the records and the policy refinement.

DeepInspect

DeepInspect ships the AI gateway architecture with all six components in one integrated proxy. TLS termination at the edge, IdP integration for identity binding (Okta, Entra ID, Ping, OIDC), deterministic classification on PII, PHI, source code, customer data, and organization-defined categories, policy evaluation against the (data, identity, model) triple, multi-provider routing, and a tamper-evident audit record series the SIEM consumes through standard log forwarding.

For programs designing an AI gateway from the architecture up, the six-component model is the reference shape. The integrated deployment is what makes the gateway operational without per-team integration work.

Book a demo today.