AI Rate Limiting by Identity: Why Per-Key Quotas Miss the Actual Risk
A per-API-key rate limit lets one runaway service consume the whole quota for its tenant. An identity-bound rate limit accumulates against the verified caller and produces a defensible refusal at the request layer. This walkthrough covers the four identity dimensions a useful rate limit accumulates against, the algorithms that hold under burst traffic, and the audit-record fields that make a refusal admissible.

A rate limit on the OpenAI API key is a tenant-wide quota. The first service that runs hot consumes the entire allocation and the rest of the tenant gets 429 responses for the remainder of the window. The control that scales binds the rate limit to the verified caller identity at the gateway, not to the API key. The accumulator is per agent, per user, per role, or per data classification depending on the policy. The refusal is recorded with the identity and the policy in the per-decision audit log.
I want to walk through the four identity dimensions a useful rate limit accumulates against, the algorithms that hold under burst traffic, and the audit-record fields that make a refusal admissible.
The four identity dimensions
A useful rate limit accumulates against one or more of the dimensions below, not against the upstream API key.
Agent identity
The verified workload identity of the calling agent. Limits the calls a single agent can make per minute. Catches the runaway loop, the misconfigured retry, and the prompt-injection payload that produces a sustained call burst.
Natural-person identity
The verified user the request is being made on behalf of. Limits the calls a single user can drive through any agent. Catches the user who pastes a 50,000-row spreadsheet and asks for summaries on each row.
Role or group
The verified group from the identity provider. Limits the cohort. "All marketing analysts" gets one budget; "all customer-service agents" gets another. Catches misallocation across teams.
Data classification
The classification of the request payload. Limits the rate at which sensitive data flows to a model. "PII-tagged requests per minute" is a separate accumulator from "general requests per minute." Catches the case where a service is correctly authorized to call the model but is dumping more sensitive data than the policy expects.
Why per-key quotas miss the actual risk
The per-key quota at the model provider has two characteristics that limit what it can enforce. The first is that the key identifies the tenant, not the caller. A single key is shared across every service that uses it; the first hungry service starves the rest. The second is that the provider applies the quota at the provider's side; the refusal arrives as a 429 response after the work to assemble the prompt is complete. A more meaningful refusal would happen before the prompt is even sent.
The identity-bound rate limit at the gateway has the opposite characteristics. The accumulator is per caller, which is the dimension the policy actually cares about. The refusal happens at the gateway before the call reaches the provider, which saves the prompt-assembly work and produces a defensible audit record.
The algorithms that hold under bursts
Two algorithms cover most cases.
Token bucket per identity
The token bucket gives the caller a refill rate and a maximum burst. The refill rate is the steady-state allowance; the burst is the headroom for occasional spikes. The token bucket holds well for synchronous workloads where the caller's natural rate is bursty.
Sliding window per identity
The sliding window counts requests in the last N seconds and refuses when the count exceeds the limit. The window holds well for asynchronous workloads where bursts are not legitimate.
The choice between the two is a property of the workload. Synchronous interactive workloads suit the token bucket; batch workloads suit the sliding window.
What "cost" to deduct per request
The simplest rate limit deducts one token per request. A more useful rate limit deducts based on the request's actual cost.
The classification weight lets the policy bias the rate limit toward refusing the heavy requests rather than the light ones. A 50-token chat is cheap; a 50,000-token RAG request with PII tags is expensive. The cost model is part of the policy and is updated as the upstream provider's pricing changes.
The refusal record
A refused request still produces a per-decision audit record. The record carries the identity, the rate-limit policy, the bucket state, and the reason for refusal.
The record is the artifact the auditor reads to confirm the refusal was policy-driven, not an outage. The signature prevents the application from rewriting the record after the fact.
How identity-bound rate limiting interacts with cost governance
A rate limit is the floor of a cost control. The accumulator bounds the rate; the budget bounds the total. The two work together. The rate limit refuses the burst; the budget refuses the cumulative overage. Both produce audit records and both enforce at the gateway.
When the platform team sets the budget per team or per tenant, the rate limit per identity feeds the budget. The team's monthly budget at OpenAI is the sum of the per-identity budgets within the team. Identity-bound rate limiting gives the platform team a defensible per-team cost allocation; per-key limits leave the team with a single line item and no attribution.
How identity-bound rate limiting interacts with abuse
A prompt-injection payload that produces a runaway call sequence triggers the per-agent rate limit and is refused. A user who pastes a 50,000-row spreadsheet triggers the per-user rate limit and is refused. A misconfigured retry that hammers the model triggers the per-agent rate limit and is refused. In all three cases the audit record carries the identity, the policy, and the bucket state. The platform team can investigate from the records rather than from the provider's 429 counts.
DeepInspect
DeepInspect implements identity-bound rate limits at the gateway against all four dimensions: agent identity, natural-person identity, role, and data classification. The accumulators use token bucket and sliding window depending on the policy. The cost model is configurable per policy and is updated as upstream pricing changes. The refusal record is signed and carries the bucket state and the reason for refusal.
The same policy plane runs the rate limits, the destination allowlist, the residency rules, and the data-classification rules. The platform team operates one set of policies, not one rate-limit posture per provider. Book a technical deep dive at deepinspect.ai to walk through the rate-limit dimensions against your workload.
Frequently asked questions
- What about distributed deployments where the gateway runs on multiple nodes?
The accumulator is shared across gateway nodes via a low-latency store (Redis Cluster, DynamoDB, a custom CRDT). Reads and writes against the accumulator add a small fixed cost per request. The cost is a property of the rate-limit design and should be part of the gateway benchmark the platform team runs.
- What if the limit refuses a legitimate burst?
The policy supports per-identity exceptions and per-route burst headroom. The exception is recorded in the policy registry and surfaces in the audit record. A legitimate burst that recurs is the signal to raise the limit; the audit records produce the evidence the rate-limit policy needs to be tuned.
- How does this interact with the upstream provider's quota?
The gateway's identity-bound limit is a layer above the provider's quota. The provider's quota still applies. The gateway's limit refuses earlier and with better attribution. The two work together; the gateway shapes the traffic the provider sees.
- What about agents that legitimately call the model thousands of times in a burst (e.g., batch evaluation)?
A separate batch-route identity gets a separate rate-limit policy with higher allowances. The agent identity used for the batch is bound to the batch policy. The interactive identity used for the user-facing route is bound to a different policy with lower allowances. The two never share an accumulator.
- Can the rate limit be tied to spend rather than calls?
Yes. The cost model in the per-request deduction can produce a USD-denominated cost (estimated from the prompt and response token counts and the upstream price list). The accumulator holds USD per period. Refusal happens when the per-period spend exceeds the budget. The audit record carries the USD value of the refused request.