AI Gateway Per-Tenant Rate Limiting: The Buckets That Actually Contain a Runaway Workload
A rate limit on the AI gateway is not a single ceiling. Enterprise deployments run several rate-limit buckets in parallel: per model, per tenant, per user, per tool, per purpose. The buckets interact, and the interaction is where runaway workloads hurt most. This piece walks through the bucket design that contains a runaway agent loop, protects the model provider's shared quota, and produces the audit records the operator needs to explain a rate-limit event.

The Zscaler ThreatLabz 2026 AI Threat Report, published June 17, 2026, reported 410M+ ChatGPT DLP policy violations across the enterprises Zscaler observed, up 99% year over year. A large share of that traffic came from a small share of users and workloads, and the enterprises that survived the load did so because their gateways carried rate-limit buckets that contained the spike. A rate limit on an AI gateway is not a single ceiling. Enterprise deployments run several buckets in parallel, and the interaction between the buckets is where runaway workloads land hardest. I want to walk through the bucket design that contains a runaway agent loop, protects the model provider's shared quota, and produces the audit records the operator needs.
The five bucket dimensions
The rate-limit buckets on an AI gateway cut across five dimensions: model, tenant, user, tool, and purpose. Each dimension answers a different operational question.
The model dimension asks whether a specific model's shared quota has been exceeded. OpenAI's GPT-4 has a tokens-per-minute quota the enterprise's account holds. Anthropic's Claude has a rate-per-minute quota. The gateway's per-model bucket keeps the account below the provider's ceiling.
The tenant dimension asks whether a specific tenant in a multi-tenant SaaS is consuming more than its allocated share. The tenant's contract describes the share, and the gateway enforces it.
The user dimension asks whether a specific human user is running above the policy's per-user ceiling. Runaway loops most often originate from a single human user's session, and the per-user bucket contains the spike.
The tool dimension asks whether a specific tool is being called at a rate above its threshold. High-sensitivity tools like wire transfers, mass notifications, and production writes carry tighter per-tool limits.
The purpose dimension asks whether a specific declared purpose is running above its purpose-level ceiling. A single agent identity can serve multiple purposes concurrently, and the per-purpose limit contains a runaway purpose without penalizing the other purposes under the same identity.
The interaction pattern between buckets
The buckets are evaluated in parallel on every request. A request that hits any bucket's ceiling gets denied and returns a 429 status to the caller.
The order matters for the audit record. The gateway records which bucket tripped, not just that a bucket tripped. The operator responding to a spike incident asks "which bucket first," and the record has to answer.
The interaction produces useful signal patterns. A per-tenant limit trip alongside a per-user limit trip usually indicates a runaway loop in a specific user's session. A per-model limit trip alongside a per-tenant limit trip usually indicates a workload the tenant did not budget for. A per-tool limit trip alone often indicates a plan bug that keeps re-invoking the same tool.
The sliding-window design
The rate-limit implementation uses a sliding window rather than a fixed window. The sliding window measures the request count over the last N seconds continuously. A fixed window measures the count within the last calendar minute or hour, which produces edge effects at window boundaries.
The sliding-window implementation runs against a redis-like store or an in-process token-bucket data structure. The store holds the timestamps of recent requests per bucket. The gateway increments the count on each request and decrements as timestamps age out of the window.
The design tradeoff is memory versus accuracy. A tight-window implementation holds more timestamps and gives more accurate rate measurements. A loose-window implementation holds fewer timestamps and uses less memory at the cost of coarser measurements.
Enterprise deployments typically run 60-second sliding windows for per-user buckets, 5-minute windows for per-tenant buckets, and 1-hour windows for per-tool buckets. The windows match the operational cadence at each layer.
The graceful-degradation pattern
A rate limit that trips returns a hard 429 status. The caller retries, and the retries hit the same limit until the window slides. The pattern works for well-behaved callers but produces poor user experience under a spike.
The graceful-degradation pattern adds a soft response mode. Before the hard 429, the gateway returns a cached response for cacheable queries, a fallback to a cheaper model for expensive queries, or a queued response for latency-tolerant callers.
The cached response applies to read-only queries the gateway can prove are safe to serve from cache. The fallback applies to queries where the enterprise's cost policy prefers a cheaper model to a hard denial. The queued response applies to callers with async workflows that tolerate a few seconds of queue.
The graceful mode's audit record captures the degradation the gateway applied. The record shows the operator that the caller received a degraded response, not a normal one.
The per-tenant fair-share allocation
The per-tenant limit for a multi-tenant SaaS runs against the tenant's allocation. The allocation is set in the tenant's contract or in the enterprise's policy.
The allocation can be absolute (tenant A gets 1000 requests per hour) or proportional (tenant A gets 30% of the model's total capacity). The proportional allocation adjusts as the model's total capacity changes, which happens when the enterprise adds providers or changes contracts.
The fair-share pattern includes a burst allowance. A tenant that has been under its allocation for a period gets a burst credit that lets it exceed the per-second rate briefly. The credit is capped and refills at the allocation rate.
The pattern produces a signal the operator uses for capacity planning. Tenants that consistently exhaust their allocation are candidates for a contract upgrade. Tenants that never approach their allocation are candidates for allocation reduction to free up capacity for the rest.
The audit records that answer the reviewer
The audit records answer four questions the operator or reviewer asks.
Which bucket tripped. The record captures the bucket dimension and identifier. The operator resolves the incident by looking at which dimension is above its ceiling.
Which caller drove the trip. The record captures the caller identity, session identifier, and the request that hit the ceiling. The operator can walk back the caller's session and find the workload pattern.
What the current rate is per bucket. The record captures the current rate against the ceiling for each bucket the request crossed. The record supports the "how close were we to the trip" question the reviewer asks after a near-miss.
Which fallback fired. The record captures whether the gateway returned a cache hit, a fallback model response, a queued response, or a hard 429. The record supports the "did the customer get a response" question the account team asks.
DeepInspect
This is exactly what DeepInspect enforces at the AI request boundary. DeepInspect sits inline between callers and the LLM APIs they call. The gateway runs the five-dimensional bucket design, applies the sliding-window pattern, and produces the audit records the operator queries after an incident.
The graceful-degradation modes are policy primitives the enterprise composes per deployment. Per-tenant fair-share allocation runs against the tenant's contract. The audit records land in a hash-chained log the operator can query per bucket, per caller, or per incident.
Book a demo today.
Frequently asked questions
- What are the five bucket dimensions on an AI gateway rate limit?
The five dimensions are model, tenant, user, tool, and purpose. Each dimension answers a different operational question: is a model's quota exhausted, is a tenant above its share, is a user in a runaway loop, is a high-sensitivity tool being called too fast, is a specific purpose exceeding its ceiling. The dimensions are evaluated in parallel on each request.
- Why does the sliding-window design beat a fixed window?
A fixed window measures the count within a calendar minute or hour and produces edge effects at the window boundary. A caller that runs exactly at the ceiling for two consecutive calendar minutes could burst twice the ceiling around the boundary. A sliding window measures the count over the last N seconds continuously and removes the edge effect.
- What is graceful degradation and when does it apply?
Graceful degradation returns a cached response, a fallback model response, or a queued response before the hard 429 fires. Cached responses apply to safe read-only queries. Fallback model responses apply when the enterprise's cost policy prefers a cheaper model. Queued responses apply to callers with async workflows.
- How does fair-share allocation work per tenant?
The allocation is set in the tenant's contract, either as an absolute request count or as a proportion of the model's total capacity. The tenant runs against the allocation with a burst allowance that lets it exceed the per-second rate briefly. The burst credit caps and refills at the allocation rate, so the tenant cannot sustain a burst over the window.
- What audit records does the pattern produce?
The records capture which bucket tripped, which caller drove the trip, the current rate per bucket, and which fallback fired. The tuple answers the operator's post-incident questions and gives the account team the record it needs when a customer asks why a request was throttled.
- How does the pattern interact with the model provider's quota?
The per-model bucket keeps the enterprise's account below the provider's ceiling. The gateway records the current rate against the ceiling on each request, and the operator can spot when the enterprise is approaching its provider quota. The pattern lets the enterprise negotiate the provider quota upgrade proactively rather than after a provider-side throttle.