What is an acceptable p95 latency for an AI gateway?

Under 50 ms for the gateway's own overhead is the number most architecture teams defend against a 500 ms to 5 second LLM inference baseline. The 50 ms figure keeps the gateway below 10% of the fastest inference and below 1% of the slowest, which stops the gateway from becoming the tail contributor the model provider is not.

How does classification latency scale with prompt size?

Regex classification scales linearly with prompt size but stays in single-digit milliseconds for prompts under 8,000 characters. Embedding classification adds a fixed overhead for the embedding computation and a small scan cost. Model-based classification runs its own inference and scales the same way LLM inference does, which is why it fails as an inline step.

What storage backend fits an inline audit-record write?

A write-ahead log with a background flush to durable storage fits the p95 budget. The append to the log runs sub-millisecond on the hot path. The durability guarantee moves to the background job that flushes to durable storage on a schedule. A hash-chained log preserves the tamper-evident property across the background flush.

How does the burst profile change the measurement?

The burst profile exposes queue buildup at each step. Identity resolution's role store queue, the classifier's embedding cache, the policy engine's rule compilation, and the audit backend's fsync queue all show up in the tail when the arrival rate exceeds the service rate. The burst profile is where the p99 tail's real value shows up.

How does the cold-start profile matter for the benchmark?

Container restart events reset the JWKS cache, the role store cache, the classifier's embedding cache, and the policy engine's compiled rule cache. The cold-start profile measures how the gateway performs on the first few thousand requests after restart. The number matters at deployment time, at auto-scaling events, and after infrastructure incidents.

How do the numbers compare to Kong or an API gateway?

An API gateway that only performs routing and authorization runs in single-digit milliseconds. An AI gateway that adds classification and audit-record write adds the extra steps this benchmark measures. The AI gateway is a specialized product, and the latency comparison against a routing gateway is not the apples-to-apples number the architecture team defends. The relevant comparison is against the LLM inference baseline the gateway sits in front of.

AI Gateway Latency Benchmark 2026: How to Measure the p95 Overhead of Every Enforcement Step

Google Mandiant's M-Trends 2026 report, based on 500,000+ hours of frontline incident response, found that the median time between initial access and handoff to a secondary threat group collapsed from over 8 hours in 2022 to 22 seconds in 2025. Inline enforcement of AI traffic has to run at a speed that keeps up with that baseline. Architecture teams argue latency budgets at review and then ship gateways that get measured only in synthetic tests. I want to walk through a benchmark methodology that separates every enforcement step, so a team can defend a p95 budget against the 500 ms to 5 second baseline of LLM inference.

The five enforcement steps to measure separately

An AI gateway that sits between authenticated users or agents and an LLM API touches five discrete steps on every request. The connection accept, the identity resolution, the prompt classification, the policy evaluation, and the audit-record write. Each step has a different failure mode and a different optimization target. A benchmark that reports a single "gateway overhead" number obscures the steps that dominate the tail.

Connection accept covers the TLS handshake with the caller and the connection reuse with the model provider. Identity resolution covers the parse of the authorization token and the lookup of the caller's role. Classification covers the inspection of the prompt for data patterns the policy cares about. Policy evaluation covers the decision function that returns pass, block, or transform. Audit-record write covers the append to the tamper-evident log.

The measurement rig

The measurement rig has to run against the same infrastructure the gateway runs in production. Synthetic tests on a developer laptop measure the gateway's code path but not the network topology or the classifier's cold-start behavior.

The rig runs a workload generator that produces requests at the target rate, a gateway under test, and a mock LLM endpoint that returns a synthetic response with a configurable delay. The mock endpoint isolates the gateway's overhead from the model provider's variance. The generator captures the timestamp before the request enters the gateway and after the response returns to the caller. The gateway captures the per-step timestamps and writes them to a side-channel.

The rig runs at three load profiles. The steady-state profile at the production p95 request rate. The burst profile at 5x the steady-state rate for 30 seconds. The cold-start profile after a full container restart.

The identity resolution step

Identity resolution reads the authorization token, verifies its signature against the identity provider's public keys, and looks up the caller's role from the role store.

The signature verification is a CPU-bound step. The cost depends on the algorithm (RS256 is heavier than HS256) and the presence of a JWKS cache. A cold cache adds an HTTP round trip to the identity provider. A warm cache adds a HashMap lookup.

The role lookup adds a database or cache read. The role store returns the caller's group memberships, the roles the groups map to, and the policies the roles carry. A role store that resolves in-process against a snapshot adds sub-millisecond overhead. A role store that reads from Redis adds the network round trip. A role store that reads from Postgres adds the query planner's overhead.

For a benchmark to be useful, the identity resolution step has to be measured with a cold JWKS cache and again with a warm cache, and with each role store option, so the architecture team can pick the option that matches its p95 budget.

The classification step

Classification runs the prompt through a set of pattern matchers, embedding models, or rule engines. The step's overhead scales with the prompt size and the classifier's complexity.

A regex classifier that matches on 200 patterns runs in single-digit milliseconds on a 2,000-character prompt. An embedding classifier that computes a vector and queries a nearest-neighbor index adds tens of milliseconds. A model-based classifier that runs its own inference adds hundreds of milliseconds and defeats the purpose of an inline gateway.

The benchmark reports the classification step's p50, p95, and p99 latencies at each classifier tier and at the prompt sizes the workload exercises. The tail is the number the architecture team defends. A p95 that looks acceptable can hide a p99 that misses the budget by an order of magnitude.

The policy evaluation step

Policy evaluation runs the decision function against the resolved identity, the classified prompt, and the current policy state. The step's overhead depends on the policy language and the policy store.

A policy engine that compiles the policy to a bytecode and evaluates in-process against a snapshot runs in sub-millisecond time. A policy engine that queries a remote policy decision point adds the network round trip. A policy engine that recompiles the policy on every request adds the compilation cost.

The benchmark measures the policy evaluation step under three policy sizes: 10 rules, 100 rules, and 1,000 rules. Enterprise deployments accumulate rules across identity types, data classifications, and regulatory obligations. A policy engine that scales linearly with rule count runs out of budget before the deployment reaches production.

The audit-record write step

The audit-record write appends the per-decision record to the tamper-evident log. The step's overhead depends on the storage backend and the write pattern.

A synchronous append to durable storage (fsync) adds tens of milliseconds. An asynchronous append to a write-ahead log with a background flush adds sub-millisecond time on the hot path but pushes the durability guarantee to the background job. A batched append adds queuing latency but reduces the per-request cost.

The tamper-evident property adds a cryptographic operation to each write. A hash-chained log recomputes the tail hash on each append. A signed log signs the record with the log's private key. Both operations run in microseconds on modern hardware, so the cryptography is not the tail. The tail is the storage backend's fsync behavior under load.

The benchmark reports the audit-record write under sustained load, so the storage backend's queue depth and fsync latency show up in the tail.

The p95 budget worth defending

An inline gateway serving LLM traffic that averages 500 ms to 5 second inference has a working budget of 50 ms end-to-end. Under 5% of the fastest inference call is the working threshold most architecture teams accept. Above that, the gateway becomes the tail contributor the LLM provider is not, and the operators start bypass conversations.

A 50 ms budget breaks down roughly to 5 ms for connection, 5 ms for identity resolution with a warm cache, 10 ms for regex classification on typical prompts, 5 ms for policy evaluation on a 100-rule policy, and 25 ms of headroom for the audit-record write and the network path to the model provider.

The benchmark reports the actual distribution and the team defends its budget against the numbers, not the intuitions. In DeepInspect's own testing, the end-to-end enforcement overhead measures under 50 ms in production tests, which is the number the architecture team agreed to defend against the LLM inference baseline.

DeepInspect

This is exactly what DeepInspect measures at its own p95. DeepInspect sits inline between users or agents and the LLM APIs they call. For every request and response, it evaluates identity, data classification, model authorization, and organizational policy, and makes a pass or block decision before the traffic reaches the model. The end-to-end enforcement overhead measures under 50 ms in production tests, against a 500 ms to 5 second LLM inference baseline.

The gateway publishes its per-step timings so an architecture team can defend the budget with actual numbers rather than architecture-review estimates. The classifier is regex plus tenant-scoped embeddings. The policy engine compiles rules to bytecode and evaluates in-process. The audit-record write appends to a hash-chained log with a background flush.

Book a technical deep dive at deepinspect.ai.