Does the gateway change how I call the OpenAI SDK?

The gateway changes the baseURL in the SDK configuration. Every other line of application code remains unchanged. The SDK calls land on the gateway URL instead of https://api.openai.com/v1. The gateway proxies to OpenAI on the application's behalf.

What happens if the gateway is down?

Behavior on gateway failure is policy-driven. The default is fail-closed: the gateway returns an error and the application surfaces the error to the user. The fail-closed posture matches the EU AI Act Article 12 expectation that records exist for every decision. Fail-open postures are configurable for non-regulated traffic.

Can the gateway inspect streaming responses?

Streaming responses are inspected as they arrive. The gateway can buffer chunks long enough to run classification on each segment and either pass the chunk through or block it. The buffering adds milliseconds of latency on the streamed token level. The user-visible latency for the first token remains close to the unbuffered case.

How does the gateway handle OpenAI's tool calling and function calling?

Tool calls are part of the request body and the response. The gateway inspects the tool definitions in the request and the tool invocations in the response. Policy can permit or block specific tool invocations based on the tool name and the arguments. The decision record captures the tool calls alongside the prompt and the response content.

Does using a gateway affect my OpenAI rate limits?

Rate limits remain at the OpenAI team-key level. The gateway shares the team's rate budget across all callers it serves. The gateway can implement its own per-user, per-application, or per-route rate limits in addition, which gives finer-grained control than the OpenAI key-level limit. The cost reporting at the OpenAI level remains tied to the team key.

The OpenAI API Gateway: Where the Inspection Point Sits Between Your Workforce and api.openai.com

An OpenAI API gateway is the inspection point HTTP traffic to api.openai.com passes through before it reaches the model. The gateway attaches identity context, runs prompt-level classification, evaluates the policy in effect at the moment of decision, and produces a per-decision audit record. The architecture sits between authenticated users or agents and the OpenAI endpoints they call: chat completions, responses, embeddings, audio, the assistants API, and the batch API. The deployer routes the application's baseURL from https://api.openai.com/v1 to the gateway URL. Every other line of application code stays the same.

I want to walk through what the gateway intercepts, how the OpenAI API surfaces map to the inspection points, and the trade-offs across SaaS-hosted, VPC-isolated, and sidecar deployments.

What the gateway intercepts

The gateway intercepts the HTTP request body, the headers, and the response stream. Each of those carries data the inspection point evaluates.

The request body contains the user message, the system prompt, the tool definitions (for function calling), and any uploaded file references. Prompt-level classification runs against the user-message content and the system-prompt content. The classifier surfaces categories the policy reads, e.g. pii.email, pii.npi, phi, source-code, secrets, customer.identifier.

The headers carry the identity context the application supplies. The gateway expects an identity-bearing header set by the application's identity middleware (typically a signed JWT in Authorization or a custom X-Identity-Context header). The identity context is attached to the request record and used as input to the policy decision.

The response stream is the model's output. The gateway runs response-side classification against it before forwarding to the caller. The same category set the prompt classifier surfaces is available to the response classifier. The policy can permit, redact, or block on the response side independently of the request-side decision.

How OpenAI's API surfaces map to inspection points

OpenAI's API has been refactored several times. The 2026 surface includes three categories of endpoints the gateway inspects.

The first category is the stateless completion endpoints: /v1/chat/completions, /v1/responses, /v1/embeddings, /v1/moderations. Each request includes the full prompt context, so the gateway inspects the request body in one pass.

The second category is the audio and vision endpoints: /v1/audio/transcriptions, /v1/audio/translations, /v1/audio/speech, /v1/images/generations. Audio and image bytes arrive as multipart uploads or base64 payloads. The gateway extracts the bytes, optionally runs a transcription step for audio, and runs the same classification pipeline against the extracted content.

The third category is the stateful endpoints: the assistants API, the threads API, the responses API with stored conversation state, and the batch API. State complicates the inspection point because the model's behavior depends on context the server holds. The gateway treats the server-held context as an upstream dependency: each request still flows through the inspection point, and the per-decision record links to the conversation thread the request belongs to.

How identity context attaches at the gateway

The OpenAI API uses static API keys for authentication. The keys identify the application or the team that holds the key. The keys do not identify the natural person or agent on whose behalf the application is calling the model.

The gateway closes that gap by treating the OpenAI API key as a downstream credential the gateway holds, not a credential the application passes through. The application authenticates to the gateway with the corporate identity context (SSO session, agent identity, role). The gateway authenticates to OpenAI with the team's API key. The decision record carries the corporate identity. The OpenAI billing reflects the team-level key.

The split satisfies NIST AI agent identity Pillar 1 (verified identity travels with the request) and Pillar 2 (delegated authority is evaluated per request). The model-side billing remains team-scoped, which keeps the OpenAI account structure unchanged.

What policy looks like at the OpenAI gateway

A representative policy artifact at the gateway:

The policy is read at request time. Every decision attaches the policy version to the record. The change-management process the deployer already runs for security policy applies to AI policy the same way.

Deployment trade-offs

Three deployment patterns cover the typical OpenAI gateway shapes.

SaaS-hosted proxy. The gateway runs as a managed service. The deployer points application traffic at the gateway's URL. The integration cost is one configuration change. The deployer accepts that prompt content traverses the gateway provider's infrastructure under that provider's contractual data-handling terms.

VPC-isolated proxy. The gateway runs as a container deployment in the deployer's VPC. The deployer holds custody of the prompt content end-to-end. Integration cost is higher: image deployment, network ingress, secret rotation for the OpenAI key. The pattern fits regulated environments where data residency or vendor-data-handling clauses prohibit transit through a third-party SaaS.

Sidecar proxy. The gateway runs alongside each application container. The latency is the lowest of the three patterns because the sidecar avoids the network round-trip to a centralized proxy. The operational complexity is the highest because each application's deployment pipeline changes. The pattern fits engineering-heavy environments where the application teams own the deployment topology.

The deployer typically starts with the SaaS-hosted shape for evaluation, moves regulated traffic to the VPC-isolated shape, and reserves the sidecar shape for latency-sensitive applications.

Performance characteristics in production

The inspection step adds work between the application and the model. The work runs against the prompt content, the classifier inference (often a small model running on commodity hardware), the policy evaluation (microsecond-scale), and the audit record write.

End-to-end enforcement overhead measures under 50 ms in production tests on internal DeepInspect benchmarks. The OpenAI model inference itself runs in the 500 ms to 5 second range depending on the model and the prompt length. The inspection overhead is invisible against the model's response time.

The dominant cost of the inspection point at scale is the audit record write path, not the inspection itself. The write path should target an append-only store with synchronous acknowledgement before the response returns to the caller, so the integrity property survives a crash of the gateway process.

DeepInspect

This is the OpenAI API gateway DeepInspect was built to provide. DeepInspect sits inline between authenticated users or agents and any OpenAI endpoint. Every chat completion, response, embedding, audio, image, batch, or assistants call passes through one inspection point. Identity is attached at the request layer. Prompt-level classification runs against the request body. Policy is evaluated per request. The per-decision audit record gets committed before the response returns to the application.

The architecture is model-agnostic, so the same inspection point sits in front of Anthropic, Azure OpenAI, Bedrock, Vertex, and self-hosted endpoints with the same policy and audit semantics. OpenAI is one entry point among several. The policy version, the audit record format, and the inspection mechanism remain consistent across them.

If you are evaluating an OpenAI API gateway for a regulated deployment, book a demo today.