LLM Proxy: The Architectural Pattern, the Operational Modes, and the Audit Record Each Mode Produces
An LLM proxy is a process that sits on the HTTP path between calling identities and LLM provider endpoints. The proxy can operate in three modes: pass-through observability, policy enforcement, or vendor multiplexing. The choice of mode decides what the audit record contains and whether the record satisfies regulatory expectations. A pass-through proxy logs the call. A policy enforcement proxy commits identity, classification, and policy state. A multiplexing proxy unifies the API across vendors. Regulated deployments typically need the enforcement mode.

An LLM proxy is a process that sits on the HTTP path between calling identities and LLM provider endpoints. The proxy intercepts requests to api.openai.com, api.anthropic.com, generativelanguage.googleapis.com, and similar endpoints, performs work on the request, and forwards the request to the actual provider. The work the proxy performs differs by operational mode. The three production modes are pass-through observability, policy enforcement, and vendor multiplexing. The choice of mode decides what the audit record contains and whether the record satisfies the contemporaneous, identity-bound, classification-aware expectation EU AI Act Article 12 and DORA Article 19 reviewers apply.
I want to walk through the three modes, what each one logs, where each one fits in the production architecture, and the architectural pattern that satisfies regulatory record-keeping at the request boundary.
The three operational modes
The modes differ in what the proxy does between intercepting the request and forwarding it.
Mode 1: pass-through observability
A pass-through proxy intercepts the request, copies it to a telemetry sink, and forwards the original to the provider. The proxy may add latency tracking, cost attribution headers, prompt tokenization counts, and response metrics. The proxy does not block, modify, or reject any request. The audit record is the captured request and response with operational metadata.
The mode supports debugging, cost reporting, and prompt evaluation workflows. The mode does not satisfy policy enforcement obligations and does not produce records bound to enterprise identity unless the calling application supplies the identity context as an explicit header.
Mode 2: policy enforcement
A policy enforcement proxy intercepts the request, attaches the enterprise IdP identity to the request through header propagation or SSO-aware proxy mode, evaluates per-route and per-role policies against the prompt classification, applies a pass, redact, or block decision, commits the per-decision record, and forwards the request only when the policy permits.
The audit record contains the identity, the role, the classification, the model and version called, the policy version, the decision outcome, and a cryptographic signature. The record is the artifact the regulator accepts under Article 12, Article 19, DORA Article 19, and Fannie Mae LL-2026-04.
Mode 3: vendor multiplexing
A multiplexing proxy presents a single API surface to the calling application and routes requests to multiple LLM providers behind the scenes. The application calls the proxy as if it were OpenAI. The proxy may route the request to OpenAI, Anthropic, AWS Bedrock, Azure OpenAI, or any other configured provider based on policy (cost, capability, availability). The mode supports vendor failover and capability matching.
The audit record covers the actual provider selected and the routing rationale. The mode complements but does not replace policy enforcement; a multiplexing proxy without identity-bound policy is operating in pass-through mode for the purposes of the regulatory record.
What the audit record contains
The record differs by mode in three specific ways.
Identity context
Pass-through and multiplexing modes record the calling application or service credential. Policy enforcement mode records the verified enterprise IdP identity, the role, and the agent (where the call is from an AI agent). The Article 19 identity-of-natural-persons requirement is satisfied only by the enforcement mode.
Data classification
Pass-through mode records the prompt content but not the classification of that content. Multiplexing mode is silent on classification by default. Policy enforcement mode evaluates the prompt classification at request time, before the model receives the prompt, and commits the classification to the record alongside the prompt.
Policy state
Pass-through and multiplexing modes do not record policy state because no policy decision is made. Policy enforcement mode records the policy version in effect at the moment of the request and the decision outcome (pass, redact, block). The record reconstructs the policy posture at any past moment.
Where the proxy fits in the production architecture
Three deployment topologies dominate.
Application-side SDK with proxy URL configured
The proxy URL is configured in the application's LLM SDK initialization. The application sends API calls to the proxy URL. The proxy forwards to the actual provider. The deployment is the simplest: one configuration change in the application and one new service to run. The application can wire identity context into the SDK as a request header.
Network-edge gateway
The proxy sits at the network egress and intercepts all traffic to known LLM provider domains. The deployment requires TLS termination at the proxy for the provider domains, which means the corporate root CA must be installed on the client devices and the application TLS pinning must be compatible. The deployment is broader: every AI request from any application is in scope.
Sidecar or service mesh integration
The proxy runs as a sidecar to the application or as a service-mesh filter. The deployment is operationally aligned with the rest of the platform and supports identity propagation through service-mesh primitives. The deployment is more involved but produces strong identity context for service-to-service AI calls.
Compliance angle
The EU AI Act Article 12 record-keeping mandate, Article 19 log content requirements, Article 26 deployer obligations, and DORA Article 19 retention requirements all assume an enforcement-mode proxy or an equivalent architecture. Pass-through proxies satisfy the call-tracking obligations but not the per-decision policy obligations. The August 2, 2026 high-risk effective date is the binding deadline for regulated AI deployments operating in the EU.
DeepInspect
This is exactly what DeepInspect does. DeepInspect runs as a policy enforcement LLM proxy that sits at the AI request boundary as an external enforcement layer, operating as a stateless proxy between authenticated users or agents and any LLM endpoint. Every HTTP request is evaluated against per-route, per-role policies using identity context the calling application supplies. The per-decision audit record is committed by the proxy, independent of the application and independent of the LLM provider, before the model response returns.
The record contains a verified identity for the requester, the role and authorization context, the data classification applied to the prompt, the AI vendor and model actually called, the policy version that governed the decision, the decision outcome, and a cryptographic signature that prevents post-hoc modification. The proxy fails closed: under uncertainty, the default decision is block, not pass. The latency overhead measures under 50 ms in internal testing, which keeps the proxy outside the LLM inference budget.
Book a technical deep dive at deepinspect.ai.
Frequently asked questions
- What's the difference between an LLM proxy and a reverse proxy like NGINX?
A reverse proxy at the HTTP layer routes requests by URL pattern, terminates TLS, and may apply caching or rate limiting. An LLM proxy is a specialized reverse proxy that understands the LLM API surface: it parses the request body, evaluates the prompt classification, applies AI-specific policy, and commits the AI-specific audit record. The two can coexist: a reverse proxy can route to an LLM proxy for AI-specific handling.
- Is the proxy a single point of failure?
A policy enforcement proxy is on the critical path for AI requests. Operational design typically uses horizontal scaling, active-active deployment across availability zones, and short health checks to maintain availability. A failed-open configuration would let requests through during proxy outage, which trades the audit obligation for availability. A failed-closed configuration aligns with the regulatory posture but requires the operational investment in availability.
- How does the proxy handle streaming responses?
LLM provider APIs commonly support server-sent events for streaming responses. The proxy parses the stream as it arrives, can apply per-token classification or response-side redaction, and forwards the (possibly modified) stream to the application. The audit record covers the streaming session as a single decision with the final classification of the assembled response.
- Can a developer-tooling proxy like Helicone or LangSmith be the enterprise audit proxy?
Developer-tooling proxies are designed for the application developer audience and capture telemetry useful for debugging and prompt evaluation. They typically do not attach the enterprise IdP identity, do not evaluate per-route, per-role policies, and do not produce the cryptographically signed per-decision record an Article 12 review expects. They can complement an enforcement proxy but do not replace it for the regulatory record.
- What about proxies that the LLM provider runs?
OpenAI, Anthropic, AWS Bedrock, and Azure OpenAI provide their own admin and audit features. The features capture the provider-side view of the API call. The view does not include the enterprise IdP identity unless explicit identity propagation is configured, and the audit record is generated by the provider whose system is the subject of the audit. The regulator generally expects an enterprise-side record independent of the provider being audited.