LLM Gateway Benchmarks: What to Measure, How to Measure It, and Where Most Vendor Numbers Mislead
Most vendor LLM gateway benchmarks publish a median latency figure under synthetic load and stop there. The numbers a platform team actually needs are the policy-decision tail latency, the policy-evaluation throughput under contention, the cold-cache impact, and the audit-write durability cost. This walkthrough shows the four measurement axes, the workload profiles that produce comparable numbers, and the failure modes that surface only at production traffic shape.

A median latency figure from a synthetic load test tells a platform team almost nothing about how an LLM gateway will behave on production traffic. The numbers a platform team needs are the policy-decision tail latency at the 95th and 99th percentiles, the policy-evaluation throughput when caches go cold, the audit-write durability cost under realistic fan-out, and the failure behavior when the upstream model endpoint slows down or returns a 5xx. Vendor benchmark posts rarely publish more than one of those.
I want to walk through the four measurement axes that matter, the workload profiles that produce comparable numbers across products, and the failure modes that surface only when the benchmark looks like real traffic.
The four axes that matter
A benchmark that excludes any of these four axes leaves the platform team guessing on the dimension that turned out to matter at go-live.
Policy-decision tail latency
The interesting number is the time the gateway adds at the 99th percentile, not at the median. A policy gateway that decides in 4 milliseconds at the median can still add 180 milliseconds at the 99th percentile when cache pressure forces a policy fetch from the control plane. The user experience of a chat interface that occasionally stalls for 200 milliseconds reads as broken even if the median request was fast.
Report p50, p95, p99, and p999 for every gateway test. Anything published as a single number is a marketing artifact.
Policy-evaluation throughput under contention
Throughput under contention measures the policy engine, not the network stack. Run two workloads concurrently against the same gateway: a high-rate, low-policy-complexity workload and a low-rate, high-policy-complexity workload. The cross-effects show up as queue depth on the policy engine's worker pool. A gateway that holds 50,000 evaluations per second on simple policies and 5,000 on complex ones may drop to 1,200 when the two run concurrently because the complex policies starve the simple ones.
Report the cross-effect ratio along with the isolated numbers.
Cold-cache impact
Most published benchmarks run after a warmup phase that fills the policy cache, the identity cache, and the classifier cache. Production traffic includes deploy-time restarts, autoscaler-driven new instances, and per-tenant cold starts. Run the benchmark against a fresh process and record the first 60 seconds of latency separately. The delta between warm-cache and cold-cache p99 is the signal a platform team needs to size headroom.
Audit-write durability cost
A gateway that writes the audit record asynchronously to a queue is faster than a gateway that flushes a signed record to durable storage on every request. The first design loses records on a node crash. The second carries a per-request fixed cost from the audit write. Measure with both modes engaged and report the per-mode latency separately. Compliance evidence is a property of the durable mode.
Workload profiles that produce comparable numbers
The same gateway under different workload shapes produces wildly different numbers. The four profiles below cover the cases a platform team needs.
Profile A: synthetic uniform
A flat request rate against a fixed token count and a fixed policy set. This is the profile most vendor benchmarks publish. It is the profile that produces the best numbers and the least useful comparison.
Profile B: production-shaped read mix
Replays the request-size distribution and the policy diversity of an enterprise workload. Token counts follow a long-tail distribution. Caller identities number in the thousands. Policy sets vary by route. This is the profile that surfaces the cache pressure problem.
Profile C: deploy-time cold start
Start a fresh gateway instance and apply Profile B from request zero. The first 60 seconds are the measurement of interest. Report the time to reach steady-state p99 and the peak p99 in the warmup window. This is the profile that surfaces the cold-cache impact.
Profile D: upstream degradation
Run Profile B with the upstream model endpoint returning a mix of 200 responses, 429 throttles, and intermittent 5xx errors. The gateway's behavior under that mix shows whether it queues, fails open, fails closed, or retries. The numbers to capture are the share of requests that reached the model, the share that were refused at the gateway, the share that timed out at the client, and the audit record completeness.
What vendor benchmarks usually omit
Three measurement choices recur across published LLM gateway benchmark posts and skew the comparison.
The first is excluding the policy-decision time from the headline number by measuring only the proxy passthrough on a no-op policy. The number that results describes the network stack, not the gateway. Production policies are not no-ops; the headline should reflect that.
The second is reporting only the median. The median is the cheapest number to optimize and the least informative number to publish. The tail is the production-experience number.
The third is omitting the audit-write cost by running with asynchronous logging. The audit record is a compliance artifact. A benchmark that excludes the cost of producing the compliance artifact is benchmarking a different product than the one the platform team will deploy.
Failure modes the benchmark catches only at scale
Three failures surface only when the workload looks like Profile B or D for at least 30 minutes.
Slow policy-eval starvation: complex policies starve simple ones in the same worker pool. The platform team sees the simple route's latency double while the complex route's latency stays flat. The fix is per-route or per-priority worker pool separation.
Audit-queue backpressure: the durable audit write falls behind the request rate, and the gateway either degrades throughput or drops records depending on the design. The fix is to size the audit subsystem for peak request rate plus a margin.
Identity-cache thrash: when the number of unique caller identities exceeds the identity-cache size, every miss costs an identity provider round trip. The fix is to size the identity cache to the active caller count and to set a sensible TTL.
DeepInspect
DeepInspect publishes benchmark numbers on the four axes above against the four profiles above. The per-decision audit record is on the critical path; the audit-write durability cost is included in the published p99. The policy engine separates per-route worker pools so complex policies do not starve simple ones. The identity cache is sized at the deployment time against the customer's active caller count.
The benchmark posture is part of the architecture. A gateway that decides in 4 milliseconds at the median but only as long as the cache is warm and the audit is asynchronous is not the same product as a gateway that decides in 7 milliseconds at the p99 with a signed audit record flushed to durable storage. The second is the gateway a platform team can put on the request path of a high-risk AI system covered by EU AI Act Article 12. Book a technical deep dive at deepinspect.ai to walk through the numbers against your workload shape.
Frequently asked questions
- What is a reasonable target for p99 policy-decision latency?
A reasonable target for an inline AI gateway on a production-shaped workload is under 20 milliseconds at the p99 for the policy decision, with the audit record flushed to durable storage on the same request. Numbers under 10 milliseconds are usually achievable when policies are simple, caller identities are bounded, and the audit subsystem is correctly sized. Numbers over 50 milliseconds at the p99 usually indicate the policy engine is contending with another workload or the identity cache is undersized.
- Should the benchmark include the model's own latency?
Separate the gateway's added latency from the model's response time. The model latency is a property of the model and the provider. The gateway latency is what the platform team buys when it picks a gateway. Report end-to-end latency for context, but the comparable number across gateways is the added latency above the model.
- How long should each benchmark run?
A 30-minute run after warmup is the minimum to catch cache pressure, audit-queue backpressure, and identity-cache thrash. Shorter runs miss the failure modes that matter. A 24-hour run catches deploy-time cold starts and the daily traffic shape, which is closer to what the gateway will actually see.
- What does "production-shaped" workload mean concretely?
A production-shaped workload follows the prompt-size distribution, the policy diversity, and the caller-identity count of the deployment. The simplest way to produce one is to replay an anonymized day of real traffic against the gateway under test. Synthetic workloads can approximate this with lognormal token distributions and Zipf-distributed caller identities, but the replay produces the most defensible numbers.
- Does fail-closed behavior affect the benchmark?
A fail-closed gateway returns a refusal when the policy engine cannot reach a decision. The benchmark should report the share of requests that hit the fail-closed path under Profile D and the time the gateway takes to recognize the upstream as healthy again. A fail-open gateway hides the policy outage by passing the request through; the benchmark should call out the design and report it as a separate column.