Why does p95 matter more than p50 for an AI gateway?

The p50 is the experience of the median request. The p95 is the experience of the 5% slowest requests, which translates to one in 20 user actions. For a high-throughput application, the p95 shapes the user-perceived latency more than the p50 does. The p99 captures the worst-case tail and is the threshold most operations teams set SLOs against.

How does upstream LLM provider latency interact with gateway latency?

Upstream LLM provider latency dominates the total request time. A typical LLM call takes 500ms to 5 seconds depending on the prompt size and the model. The gateway's contribution sits on top of that base latency. For policy enforcement to be unnoticed by the user, the gateway overhead has to stay well below the variance of the LLM provider itself.

What about throughput per dollar?

Throughput per dollar varies with the deployment model (per-node licensing, per-request pricing, infrastructure cost). The benchmark should report throughput per node along with the resource consumption (CPU, memory, network). Customers can then map the numbers to their own cost model.

How do you benchmark fail-closed behavior?

Fail-closed behavior is the gateway's response when policy evaluation cannot complete (classifier timeout, identity lookup failure, policy store unavailable). The benchmark induces these conditions and measures the gateway's response: deny with audit, deny with retry, or fail open. The right behavior for regulated environments is deny with audit.

What changes for agentic AI workloads?

Agentic workloads issue many LLM calls per user action, often in burst patterns. The benchmark for agentic workloads should measure latency under burst concurrency, not just sustained load. The throughput ceiling under bursts is often lower than the sustained ceiling because the queue depth grows faster than the system can drain it.

AI Gateway Performance Benchmark: What to Measure and How

AI gateway performance benchmarks get cited in vendor comparisons and procurement documents. Most of the numbers travel without methodology. The headline figure that a vendor cites in a marketing page is rarely the figure that matters in production. The figures that matter are p95 and p99 latency under realistic concurrency, tail behavior when policy evaluation gets expensive, throughput ceiling per node, and behavior under upstream provider degradation.

I want to walk through the benchmark methodology that produces production-actionable numbers, the comparison points worth tracking, and the common mistakes that produce benchmarks the operations team should not trust.

What to measure

A production-actionable benchmark covers four properties of an AI gateway under realistic load: latency distribution, throughput, behavior under load, and fail-mode behavior.

Latency distribution

p50, p95, and p99 latency under varying concurrency. The single-number headline (the median or p50) hides the tail. The p95 and p99 are what shape user-facing experience for the rare-but-frequent slow request.

Measurement is a histogram of request-completion times under sustained load, broken down by the policy complexity of the request. Simple identity-only policy is one bucket. Identity-plus-classification is a second bucket. Identity-plus-classification-plus-cross-reference-data-call is a third.

Throughput ceiling

Requests per second per node under sustained load. The throughput ceiling is the load at which p95 latency degrades beyond an acceptable threshold. Most gateways show a knee in the latency curve: linear under light load, increasingly non-linear as the queue depth grows, hockey-stick when CPU or memory pressure hits the cap.

Behavior under load

What happens at and past the throughput ceiling. The gateway can shed load via 429 responses, queue requests until they time out, or fail closed by denying requests. The right answer depends on the deployment posture. For regulated environments, fail-closed denial with a specific status code is usually preferred over a silent timeout.

Fail-mode behavior

What happens when the upstream LLM provider degrades. Provider 5xx responses, slow responses, rate-limit responses, and outright outages. The gateway should isolate the upstream failure from the policy decision path and surface a useful error rather than masking it.

Methodology

The benchmark methodology determines the trustworthiness of the numbers. Several methodology choices have outsized effect on the results.

Realistic prompt size distribution

LLM prompts vary in size from a few hundred tokens to tens of thousands. A benchmark that runs all prompts at 100 tokens produces low-latency numbers that do not match production. The benchmark should sample from a realistic distribution of prompt sizes, typically log-normal with a long tail.

Realistic policy complexity

Identity-only policy evaluation is cheap. Classification across multiple data classes plus cross-reference data calls is expensive. The benchmark should report numbers per policy complexity bucket. Reporting a single "policy evaluation latency" without specifying the complexity hides the production-relevant numbers.

Concurrency that matches production

The benchmark should run at concurrency levels that match the target production environment. A benchmark at 10 concurrent requests per second produces numbers that do not predict behavior at 1000.

Cold start vs warm

Several gateway architectures cache policy decisions, identity lookups, or classification results. The first-request latency on a cold cache differs from the warm-cache latency. Both numbers matter. The cold-cache number predicts behavior after a deployment or a cache eviction event.

Wire-time isolation

The benchmark should isolate gateway processing time from upstream LLM provider time. The gateway's contribution to the request latency is the metric that matters for comparing gateways. Upstream provider variation should be excluded from the gateway measurement.

Comparison points worth tracking

The numbers that produce useful gateway comparisons:

| Metric | What it tells you | |---|---| | p50 latency at 10 req/s, simple policy | Best-case overhead | | p95 latency at 100 req/s, classification policy | Production-relevant tail | | p99 latency at 500 req/s, full policy | Worst-case tail under load | | Throughput ceiling at p95 < 100ms | Capacity per node | | Behavior at 2x throughput ceiling | Fail-mode under overload | | Cold-cache p95 vs warm-cache p95 | Caching architecture impact | | Latency added for cross-provider routing | Multi-provider penalty |

The single most-cited number in marketing copy is the "sub-X ms overhead" figure, which is almost always the warm-cache simple-policy p50. The production-relevant number is the cold-cache complex-policy p95 at expected concurrency.

Common mistakes

Several benchmarking patterns produce numbers that do not predict production behavior.

Mistake 1: Benchmarking against a single fixed prompt

A fixed prompt size produces a single latency distribution. The realistic distribution of prompt sizes produces a different latency distribution because larger prompts spend more time in classification and policy evaluation.

Mistake 2: Ignoring policy complexity

A gateway running identity-only policy has different latency than the same gateway running multi-class classification with cross-reference lookups. The benchmark needs to vary policy complexity.

Mistake 3: Running on developer hardware

A benchmark on a laptop produces numbers that do not translate to production servers. The hardware difference between a developer machine and a production server (often 8-16 cores, ECC memory, SSD vs NVMe) shows up in the tail latency.

Mistake 4: Ignoring upstream variance

LLM provider latency varies by region, by time of day, and by request size. A benchmark that does not isolate gateway time from upstream time produces noisy results that depend on which provider was sampled during the benchmark window.

Mistake 5: Single-run results

Single-run benchmarks have run-to-run variance that can exceed the difference between two gateways. The benchmark should average across multiple runs at the same configuration and report the variance.

DeepInspect

DeepInspect's performance posture from internal testing measures sub-50ms p95 overhead at realistic concurrency for the policy-evaluation-and-classification path. The number is the warm-cache p95 with prompt-level classification enabled and identity-aware policy applied. The cold-cache p95 is higher and is tracked separately.

The benchmark methodology DeepInspect publishes covers the four properties above: latency distribution per policy complexity, throughput ceiling, fail-mode behavior under overload, and upstream-provider isolation. The methodology is published alongside the numbers so customers can run their own validation against their target workload.

If you are evaluating AI gateway products and want a benchmark you can run against your production workload, Book a demo today.