Why not enforce rate limits inside the application?

Application-side rate limits share the application's failure modes: a misconfigured app bypasses the limit, a compromised app removes the limit, and a multi-app environment has no consistent place to coordinate limits across apps. The gateway sits in front of every app and enforces uniformly.

How granular should per-identity rate limits be?

The granularity depends on the identity model. For human users in an enterprise tenant, limits per user per minute and per hour catch most abuse. For agent identities, limits per agent identity per minute and per cost-per-hour catch runaway loops. For service identities, limits scoped to the calling workload tier are typical.

What happens when a rate limit fires during a legitimate burst?

The gateway returns a structured error with a retry-after header. The application decides whether to retry, surface the error to the user, or queue the request. The audit record captures the rate-limit decision so the operator can distinguish abuse patterns from legitimate bursts and adjust limits as the workload evolves.

Does the gateway protect against the provider being down?

Partially. The gateway can fail over to a secondary provider when the primary is unavailable, subject to the cost ceiling and the route configuration. Total provider-side outages are still felt by the application; the gateway minimizes the blast radius rather than eliminating it.

How are agent loop circuit breakers configured?

The default configuration sets a step cap and a cost cap per session, both tunable per route. The step cap protects against runaway loops; the cost cap protects against expensive but short loops. Both produce structured terminations the application can handle.

Where does this fit in OWASP AISVS?

OWASP AISVS Chapter 11 (operational availability) and Chapter 12 (logging and monitoring) cover the verification requirements for the LLM04 surface. The chapters' requirements include documented rate limits, measurable agent loop termination conditions, and per-decision logs of resource decisions.

OWASP LLM04 Model Denial of Service: Gateway Controls That Actually Hold Under Load

OWASP LLM04 covers model denial of service. The Top 10 entry describes a class of attacks where a user-issued prompt drives the model or its serving infrastructure into a resource-exhaustion state. The cost asymmetry is the architectural lever: a single short prompt can produce a long, expensive response; a single API call can chain into many downstream model calls through an agent loop; a single user can saturate a tenant's token budget within minutes.

The category is the cleanest case in the Top 10 where the defense lives at the network boundary in front of the model. Per-identity rate-limits, per-route token budgets, output-length caps, and circuit-breaker behavior on agent loops are all enforceable at the layer where every request and response is observable. The application cannot self-enforce these reliably because the application is one of many possible callers and because rate-limits implemented inside the application share the application's failure modes.

I want to walk through the LLM04 attack patterns that show up in practice, the gateway controls that hold under load, the metrics that have to be in place for the controls to be tunable, and the residual application work that the gateway cannot replace.

The attack patterns

Four LLM04 patterns recur across production incidents.

Recursive expansion. The attacker issues a prompt that induces the model to produce a long output, then iteratively re-feeds the output back as a new prompt. Each round amplifies. An agent loop with a high max-step setting is particularly exposed because the loop continues until a stop condition fires, and a poisoned stop condition extends the loop further than the architect intended.

Token bombing. The attacker submits a prompt that is short to issue but expensive to process, typically through prompt features that trigger long-context attention costs or that exploit specific model behaviors around repetition. Some prompts produce responses that hit the model's max-token cap on every request, multiplying the per-call cost compared to typical traffic.

Tool-chain amplification. The attacker submits a prompt that causes the agent to invoke a series of tools, each of which calls the model again to process the tool result. A single user request becomes ten model calls. The amplification factor depends on the tool definitions and the agent loop's branching behavior.

Concurrency exhaustion. The attacker opens a large number of concurrent sessions, each issuing requests at a rate below any per-session rate limit but cumulatively saturating the tenant's quota or the provider's available capacity. Sessions tied to the same identity through stolen API keys or weak per-session authentication amplify the impact.

The gateway controls that hold

Five controls do most of the LLM04 work when enforced at the gateway layer.

Per-identity rate limits. Every request is bound to a verified identity at the gateway. The rate limit is keyed on that identity, not on a session cookie or an API key. A single identity that opens many concurrent sessions still hits the per-identity limit because the limit applies across sessions. The control is implementable as a sliding window over requests per second, tokens per minute, or cost per hour, with separate budgets per dimension.

Per-route token caps. Each policy route specifies a maximum input token count, a maximum output token count, and a maximum total cost per call. Requests that exceed any cap are rejected at the gateway before the model is invoked. The reduces the cost asymmetry by setting a hard ceiling per request that the application cannot bypass.

Agent loop circuit breakers. The gateway tracks the per-session step count and per-session cumulative cost for agent loops that proxy through it. When either threshold is crossed, the gateway terminates the loop and returns a structured error to the application. The application can decide whether to surface the error to the user or retry under a different scope. The control bounds the worst-case cost per agent session.

Concurrency limits per identity. Beyond rate limits on request count, the gateway caps the number of concurrent in-flight requests per identity. The cap protects against burst-then-wait attack patterns that stay below the rate limit on average but exhaust capacity in short windows.

Provider failover with cost ceilings. When the primary provider is saturated, the gateway can route to a secondary provider, but only when the secondary is within a configured cost-per-request ceiling. The control protects against attacks that try to force the system onto an expensive provider as a side effect of saturating the cheap one.

The metrics that make the controls tunable

The controls only work if the operator has visibility into the dimensions the controls operate on. Five metrics need to be instrumented at the gateway and exposed to the monitoring stack.

The metric names follow Prometheus conventions. The dimensions support per-identity, per-route, and per-provider slicing. The terminated_by label on the agent loop metric distinguishes natural termination, step cap, cost cap, and policy block. The decision label on the request counter distinguishes permit, deny, redact, and rate-limited rejections.

With those metrics in place, the operator can set rate limits at percentiles of normal traffic, alert on identities that approach but do not exceed limits, and tune circuit breakers based on actual agent loop distributions rather than guesses.

What sits outside the gateway boundary

Two LLM04 surfaces sit outside what a gateway can enforce.

Infrastructure-level resource exhaustion at the model provider. If the attacker is targeting the OpenAI or Anthropic infrastructure directly through some out-of-band channel, the gateway cannot intervene. The gateway operates on the traffic between the enterprise's applications and the provider's API endpoints. Attacks against the provider's underlying compute are the provider's problem.

Cost exhaustion in workloads that bypass the gateway. A workload that calls the model API directly, without routing through the gateway, is invisible to the gateway's rate limits. The control is only as effective as the perimeter that forces all traffic through the gateway. This is the standard reason every AI provider's API key needs to be managed centrally and never distributed to applications that can issue direct calls.

The application work the gateway cannot replace

The gateway sets ceilings; the application still needs to handle the ceiling responses gracefully. When the gateway returns a rate-limit error, the application has to decide whether to surface the error to the user, queue the request for retry, or fail open with a degraded response. The decision is application-specific and depends on the user's intent.

When the gateway terminates an agent loop at the step cap, the application needs to surface a meaningful message to the user, log the termination as a signal that the agent's task framing produced runaway behavior, and feed that signal back into agent prompt design. A loop that hits the cap on benign traffic is a prompt-engineering bug. A loop that hits the cap on attacker traffic is a security event.

How LLM04 maps to regulatory and availability requirements

EU AI Act Article 15 requires high-risk AI systems to maintain a level of accuracy, resilience, and cybersecurity appropriate to the intended purpose, throughout the system lifecycle. A DoS that takes down the AI component of a high-risk system is an Article 15 failure. The compliance posture requires documented controls and incident response procedures for resource-exhaustion attacks.

ISO 42001 requires availability commitments for AI systems and the documentation of the controls that protect those commitments. Per-identity rate limits, agent loop circuit breakers, and provider failover are concrete controls that satisfy the AIMS availability requirements when documented in the management system.

SOC 2 Type II audits include availability as one of the trust service criteria. AI workloads are increasingly in scope for SOC 2 audits, and the gateway-layer controls described above produce the evidence auditors look for: documented thresholds, logged decisions, post-incident reviews, and tuning history.

DeepInspect

This is the layered control DeepInspect provides for the LLM04 surface. DeepInspect sits inline between authenticated users or agents and the LLMs they call, binds every request to a verified identity, and enforces per-identity rate limits, per-route token caps, agent loop circuit breakers, and concurrency limits independently of the calling application.

The controls hold under load because they are evaluated at the gateway against the actual identity and the actual request shape, not against application-supplied metadata that an attacker can manipulate. Every rate-limit and circuit-breaker decision produces a per-decision audit record with identity, route, provider, decision, and the metric values that triggered the decision. The forensic trail is available to the security team independent of the application that issued the request.

If you are mapping the OWASP LLM Top 10 controls against your current architecture and your LLM04 coverage depends on application-side rate limiting that the application controls, let's talk today.