Does the OpenAI gateway pattern work for Azure OpenAI deployments?

Yes, the same pattern applies. Azure OpenAI uses a different base URL and a different authentication scheme (Azure AD token or API key on a deployment-specific URL), and the inspection layer addresses the Azure endpoint instead of api.openai.com. The application points its OpenAI SDK at the inspection layer, the inspection layer forwards to the Azure deployment, and the policy and audit machinery work the same way. Most production deployments that use Azure OpenAI for the data-residency benefit run the inspection layer in the same Azure region as the deployment to keep the network latency low.

How does the gateway handle OpenAI rate limits and quota?

The inspection layer holds the OpenAI API key and is subject to OpenAI's rate limits at the organization level. The inspection layer surfaces a 429 from OpenAI as a 429 to the calling application, with the inspection layer's own retry-after header that respects OpenAI's backoff guidance. Operators who run heavy workloads provision multiple OpenAI organizations and the inspection layer routes across them based on the route identifier or a load-balancing rule. The audit record stamps which OpenAI organization served the request so cost attribution and quota analysis are reproducible.

What does the gateway do with OpenAI's streaming responses if the policy says block?

A block decision aborts the upstream stream and returns a structured error response to the application. The application sees an HTTP error with a machine-readable code that the application can handle (display a refusal message, suggest a different workflow, log the event). The audit record captures the prompt fingerprint, the policy that produced the block, and the decision outcome. The model never produced the streamed tokens because the inspection layer evaluated the policy on the request body before the upstream call.

How does the gateway handle OpenAI's o1, o3, and reasoning models with their longer inference time?

The reasoning models run for tens of seconds and produce a different response shape (reasoning tokens, completion tokens, longer streaming windows). The inspection layer's request-time policy evaluation works the same way: identity, policy, data classification, model authorization, then forward to OpenAI. The streaming response handling has to tolerate the longer window without timing out. Production deployments configure the inspection layer's HTTP timeout for the reasoning models specifically and run the policy evaluation hooks on the reasoning-token stream where the policy needs to inspect intermediate output.

Does the gateway change anything for OpenAI's fine-tuning and batch APIs?

The inspection layer covers fine-tuning and batch endpoints with the same pattern. The fine-tuning request submission and the batch job submission both go through the gateway, which evaluates whether the caller is permitted to fine-tune on the data referenced in the request and whether the batch payload is permitted under the policy. The audit record captures the fine-tuning job identifier and the batch job identifier, which lets the auditor trace every job back to the request that submitted it. The inspection layer does not need to be on the data path of the long-running job, because the job is OpenAI-side; the inspection layer recorded the submission decision.

OpenAI API Gateway Patterns: How To Front api.openai.com with Inline Enforcement

An OpenAI API gateway addresses the inspection layer instead of api.openai.com. The application replaces the base URL on its OpenAI client and ships every chat completion, embedding, image generation, and audio request through the gateway. The gateway authenticates the caller, attaches identity context to the request, evaluates a policy bundle, commits a per-decision audit record, and forwards the request to OpenAI. The response returns through the same path, with the gateway recording the response handling decision and any modifications. The architecture works for the synchronous JSON endpoints and the streaming Server-Sent Events endpoints, and it covers the function calling and tool use that produce additional policy implications.

I want to walk through the request rewriting pattern, the streaming response handling, the function-calling evaluation, and the audit record format that holds up under EU AI Act Article 12 review.

The base URL replacement pattern

The OpenAI client libraries support a base_url override on construction. The application instantiates the client with the inspection layer's URL instead of https://api.openai.com/v1. Every subsequent call reaches the inspection layer, which performs the policy evaluation and forwards the request to OpenAI on the gateway's own OpenAI credential. The application never holds the OpenAI key directly. The inspection layer holds the OpenAI key in a controlled secret store and rotates it on a schedule.

The override is one line of code in the Python, Node, Go, and Java SDKs. The applications that ship the inspection layer in production usually wrap the SDK construction in a thin internal package so the override is centralized. The package reads the gateway URL from configuration and the caller-identity token from the application's existing identity primitive (usually a service account JWT or an OIDC token from the application's identity provider). The inspection layer verifies the token, extracts the identity context, and evaluates the policy.

The pattern works for the official OpenAI SDKs, for the LangChain and LlamaIndex wrappers that take a base_url, and for any direct HTTP client that calls the OpenAI REST API. The pattern fails for SDKs that hard-code the OpenAI base URL, which is rare; the workaround is a per-application HTTP proxy that intercepts at the network layer instead of at the SDK layer.

Streaming response handling

The OpenAI Chat Completions endpoint supports a streaming mode that returns Server-Sent Events. The streaming mode is the default for production deployments because it reduces perceived latency. The inspection layer has to handle the stream without buffering the entire response, because buffering would defeat the latency benefit and would force the inspection layer to hold full responses in memory.

The implementation runs the stream through the inspection layer as a pass-through with two evaluation hooks. The first hook fires on the request body: identity, policy, data classification, and model authorization all evaluate before the request leaves the gateway. The second hook fires on each chunk of the response stream: a chunk-level filter that can rewrite or block the chunk if the policy outputs a filter rule. Most deployments evaluate the first hook on the full request and the second hook on the streamed completion, with the second hook running a fast regex or classifier over the streamed tokens.

The audit record commits on stream close (success) or on stream abort (block, error). The record carries the request body fingerprint, the streamed response fingerprint, the policy version, the identity, the decision outcome, and the integrity signature. The auditor reading the record reconstructs what happened without having to replay the stream.

Function calling and tool use

OpenAI's function calling and tool use produce a structured response from the model that the application is supposed to execute, with the result fed back into a follow-up model call. The pattern is the loop that produces "agentic" behavior in modern LLM applications. The inspection layer has policy implications at three points in the loop.

The first is at the initial request, where the inspection layer evaluates whether this caller is permitted to invoke this set of tools through this model. The list of tools is a structured field on the request and the policy can match against it directly.

The second is on the function-call response from the model. The model returns a structured object that describes which function to call and with what arguments. The inspection layer evaluates whether the proposed call is permitted. A request to call a delete_records function with arguments that reference a regulated record type can be blocked at the inspection layer before the application executes it. The audit record stamps the proposed call, the policy decision, and the outcome.

The third is on the follow-up request that carries the function-call result back to the model. The inspection layer evaluates the result against the policy. A function-call result that contains PII or other regulated content can be redacted before it reaches the model, which closes the leak channel that the function-call loop opens by default.

The audit record for an agentic loop is a series of per-decision records, one per model call, with a correlation identifier that ties the records together. The auditor can reconstruct the full loop from the record series.

Identity attribution at the OpenAI request

OpenAI's API logs the call against the organization that owns the API key. A direct integration that uses a single organization-bound key produces logs that attribute every call to the organization, with no identity for the natural person whose action triggered the request. The inspection layer closes the gap.

The inspection layer extracts the identity context from the inbound request (the application's JWT, the OIDC token, the service account credential) and attaches it to the audit record. The OpenAI request itself goes out on the inspection layer's API key, which is fine for OpenAI; the inspection layer's audit record carries the identity that OpenAI's logs do not. When the regulator asks "who called this model with this prompt at this moment," the inspection layer answers from the audit record. OpenAI's own logs are not the answer, because OpenAI's logs cannot answer that question for any direct integration.

Audit record format

A per-decision audit record from an OpenAI gateway carries the following fields in production deployments. The request identifier and the correlation identifier for agentic loops. The identity context (user, tenant, role, group). The route identifier and the policy version. The model and version called (gpt-4-turbo-2024-04-09, gpt-4o-2024-08-06). The request hash and the response hash. The data classification outcome (PII detected and redacted, source code detected and blocked, financial-decision content detected and routed to a different policy). The decision outcome (pass, block, modified). The token counts (prompt tokens, completion tokens). The timestamp and the integrity signature.

The format consumes the EU AI Act Article 12 requirement for traceability, the Article 26 deployer requirement for record-keeping, the Fannie Mae LL-2026-04 lender record requirement, and the NIST AI agent identity and authorization Pillar 3 action lineage requirement.

DeepInspect

This is the gateway pattern DeepInspect runs in production. DeepInspect sits inline in front of api.openai.com (and Anthropic, Bedrock, Azure OpenAI, Vertex, and self-hosted endpoints) and handles every request, including streaming and function calling. The base URL override on the OpenAI SDK is the only application-side change. The inspection layer verifies the caller's identity from the existing identity primitive, evaluates the policy bundle for the route, commits the per-decision audit record, and forwards the request to OpenAI on a controlled credential.

The audit record carries identity, policy version, model and version, request and response fingerprints, decision outcome, and a cryptographic integrity signature. The same record format covers every OpenAI request type (chat completions, embeddings, images, audio, fine-tuning calls). The OpenAI organization-bound logs and the inspection layer's per-decision records compose a defensible record series for any EU AI Act or Fannie Mae review.

If you are running a direct integration to api.openai.com and the security review is asking for identity attribution at the model API call, let's talk.