← Blog

AI Gateway Rate Limiting by Identity: Why Per-Key Limits Fail in Production

AI gateway rate limiting that uses the API key as the limit boundary fails in three production patterns: shared service accounts, agent fan-out, and cost runaway from a single high-volume identity. The fix is to limit per verified identity, where identity is the authenticated principal extracted from the request context, not the API key in the header. This article walks the failure modes, the architecture that fixes them, the data model the gateway needs, and the operational tradeoffs of identity-bound limits versus simpler per-key approaches.

ByParminder Singh· Founder & CEO, DeepInspect Inc.
Platform & Architectureai-gatewayrate-limitingidentityengineeringpolicy-enforcementagentic-ai
AI Gateway Rate Limiting by Identity: Why Per-Key Limits Fail in Production

AI gateway rate limiting that uses the API key as the limit boundary fails in three production patterns. The first is the shared service account: a single key sits in the application and every user routes through it, so per-key limits cap the entire service. The second is agent fan-out: an agent loop can invoke the gateway a hundred times under one user request, and per-key limits either cap legitimate work or let one tenant exhaust the key allocation. The third is cost runaway from a single high-volume identity: a service account that has been compromised by a misbehaving job or by an attacker can drive the key against the provider rate limit, and the gateway's per-key limit cannot distinguish the runaway identity from the rest of the traffic. The fix is to limit per verified identity, where identity is the authenticated principal extracted from the request context, not the API key in the header.

I want to walk through the failure modes of per-key limits, the architecture that fixes them, the data model the gateway needs, and the operational tradeoffs of identity-bound limits versus simpler per-key approaches.

Why per-key limits fail

The API key in the gateway request authenticates the application or service that is calling the gateway. It does not authenticate the user or agent on whose behalf the application is calling. In any deployment where the application serves more than one user or agent, the API key abstracts away the operational identity that the limit needs to apply to.

In the shared service account pattern, the application holds one provider key, the application's own users authenticate to the application separately, and every model call passes through the single key. The gateway sees one identity. The per-key limit caps the entire application, which means a single misbehaving user can degrade the experience for all other users.

In the agent fan-out pattern, an agent that decomposes a user request into 50 model calls produces 50 gateway requests under one identity. If the limit is set to allow the legitimate fan-out, it allows a malicious agent to consume the same budget for less productive work. If the limit is set to cap the fan-out, it breaks the legitimate agent. The per-key limit has no way to distinguish.

In the cost runaway pattern, a service account that has been compromised drives a high request rate that looks legitimate at the per-key level. The provider rate limit fires, the application returns errors to all users, the on-call team scrambles to find the runaway, and the postmortem identifies a single principal as the cause. The per-key limit did not produce the early signal that would have caught the runaway in time.

What identity-bound limits look like

Identity-bound limits apply per verified principal. The verified principal is the identity of the user or agent that initiated the call, extracted from a token the application attaches to the gateway request. The token can be a JWT issued by the application's identity provider, a header set by the application after authentication, or a binding between the application's session and the gateway request that the gateway can verify.

The limit then applies per principal. A user with a budget of 1,000 calls per hour can consume 1,000 calls per hour. An agent with a budget of 10,000 calls per hour can consume 10,000 calls per hour. A service account that is shared by the application is limited at the service-account allocation, which the operator sets explicitly.

The data model the gateway needs is a per-identity counter plus a per-identity quota table. The counter increments on each gateway request from that identity. The quota table maps identity to the allocation per window. When the counter reaches the allocation, the gateway refuses subsequent requests until the window resets.

The architecture in pseudocode

The gateway's request path runs roughly as follows.

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The counter increments at the request boundary, not at the response boundary, so concurrent requests cannot exceed the allocation by racing past the check. The token-based counter increments at response time because tokens are not known until the response returns.

The data model the gateway needs

The gateway data model has three durable tables and one ephemeral counter.

The identity table maps an external identity (subject claim, email, agent ID) to an internal identity key. The internal key is stable across token refreshes and across changes to the external identifier.

The quota table maps internal identity to the allocation per window. Allocations are usually expressed in calls per minute, calls per hour, tokens per minute, and tokens per hour. Different identity classes can have different default allocations (per-user, per-agent, per-tenant service account).

The policy table maps internal identity plus request shape to a permitted/denied decision. The policy can be expressed as identity attributes plus request attributes evaluated by a rule engine.

The counter is the per-identity per-window count. It is held in a fast store (Redis or equivalent) keyed by identity and window. The counter is the only piece of state that needs millisecond-level access.

The operational tradeoffs

Identity-bound limits add operational complexity. Three tradeoffs recur.

Identity propagation. The application has to pass the user or agent identity to the gateway in a verifiable form. The gateway has to validate the identity context on each request. The overhead is single-digit milliseconds when the validation is local and well-instrumented.

Quota administration. The operator has to set quotas per identity class, not just per key. The administration burden is proportional to the number of identity classes. In practice the classes are few (employee, contractor, agent, service account, per-tenant tier), so the table stays manageable.

Counter precision. A counter that lives in a single in-memory store is fast but loses precision under store restarts. A counter that is durable across restarts costs more per request. A common pattern is a fast in-memory counter with periodic flush to durable storage, with the precision tradeoff bounded by the flush interval.

When per-key limits still make sense

Per-key limits are not always wrong. Two cases keep them useful.

The first case is the bootstrap provider rate limit. The provider sees the gateway as one customer and applies the customer's overall rate limit at the per-key level. The gateway has to stay inside that limit aggregated across all identities. Per-key limit at the upstream is the right abstraction at the provider boundary.

The second case is unauthenticated traffic. A public AI tool that does not require authentication has to apply some limit, and the IP or the API-key-equivalent is the only identity available. The mitigation here is to require authentication for anything beyond a low-volume trial.

In both cases the per-key limit is a complement to the identity-bound limit, not a replacement.

The audit record produced by identity-bound limits

A side effect of identity-bound rate limiting is the audit record the gateway produces. Each request is keyed on identity. The audit trail shows which identity consumed which calls, at what cost, against which policy. Post-incident reconstruction is mechanical: query the audit by identity and window.

The audit record is the same artifact that satisfies the EU AI Act Article 19 logging obligation and the Article 26 deployer monitoring obligation. The records that the rate limiter writes for cost control are the records the compliance program needs for monitoring. The two use cases share the same substrate.

DeepInspect

This is the identity-bound rate-limiting architecture DeepInspect operates on. DeepInspect sits inline between authenticated users or agents and the LLMs they call, extracts the verified identity from each request, applies per-identity rate limits and budget caps, and writes the per-decision audit record with consumption metrics attached.

For the rate-limiting use case specifically, DeepInspect's per-identity enforcement covers the failure modes per-key limits do not. The shared service account, the agent fan-out, and the cost-runaway-from-one-identity patterns are all addressed by the same control. The same records feed the compliance use case.

If you are running an AI workload where the per-key limit is no longer sufficient, book a demo today.

Frequently asked questions

How does the gateway verify identity claims?

The application attaches a token to the gateway request. The token is typically a JWT issued by the application's identity provider, signed by a key the gateway can verify. The gateway validates the signature, the expiry, and the audience claim, and extracts the principal. The verification is cached for the token's validity window to avoid re-validation cost per request.

What about anonymous or unauthenticated traffic?

Anonymous traffic does not have a verified identity to limit per. The gateway can apply a per-IP fallback, but the fallback has the same limitations as per-key limits. The structural answer is to require authentication for anything beyond a low-trial volume.

How are quotas administered in practice?

Quotas are usually managed through a policy file or a policy-as-code system, with the operator defining identity classes and per-class allocations. The system supports overrides for specific identities. The administration burden is one-time setup plus periodic adjustment as workloads change.

Does identity-bound limiting work for batch jobs?

Yes, with the same data model. A batch job runs under a service-account identity. The service account has its own allocation. The job's calls increment the service-account counter. The job can be scheduled to run within the service account's allocation, or the allocation can be sized for the expected job throughput.

What about token-based limits versus call-based limits?

Both. The counter typically tracks both calls per window and tokens per window, with separate quotas per dimension. Calls-per-window bounds throughput. Tokens-per-window bounds cost. A workload can hit either limit first depending on the request shape.

How does this interact with provider-side rate limits?

The gateway has to stay inside the provider's overall rate limit for the key it holds. The gateway-imposed per-identity limits sum to less than or equal to the provider-imposed key limit. When the provider rate limit is reached, the gateway returns the appropriate error to callers and uses the rate-limit signal to back off.