Why did OWASP rename Model Theft to Unbounded Consumption?

The 2023 LLM10 "Model Theft" category captured a real but narrow risk (model exfiltration via API). The 2025 community feedback was that the dominant production incidents involved cost runaway, latency degradation, and wallet drain, not model theft. The new framing covers all three including model-theft-as-DoS where applicable.

Do model provider rate limits already cover LLM10?

Provider rate limits cover the provider's side. They protect the provider's infrastructure. They do not give the deployer per-identity isolation, per-tenant budgets, or per-model authorization. The provider sees one customer. The deployer needs to see individual identities under that customer.

Where do circuit breakers fit?

Circuit breakers belong at the upstream-call layer. If the model provider's API is returning errors or high latency, the circuit breaker temporarily stops forwarding requests. The gateway can implement the circuit breaker; the policy is which response code to return to callers when the breaker is open. A circuit breaker is a complement to rate limiting, not a substitute.

How are budgets implemented at the gateway?

A budget is a counter per identity per billing window. Each request decrements the counter by the request's cost (input tokens plus weighted output tokens). When the counter reaches zero, subsequent requests are refused with a clear error code. The counter resets at the start of the next billing window or on explicit reset.

What about agentic workflows that fan out into many model calls?

Agentic LLM10 risk is higher because the agent loop can amplify token consumption per user request. The gateway's per-identity budget caps catch the runaway at the identity boundary. A per-task budget that the agent passes through as part of the request context lets the gateway enforce per-task limits too.

How does this map to the August 2 EU AI Act enforcement?

Article 14 human oversight and Article 26 deployer obligations both implicate LLM10-class controls. A high-risk AI system that experiences an LLM10 incident triggers the deployer's suspension obligation. The gateway audit trail is the evidence the deployer needs to investigate and resume operation.

OWASP LLM10 Unbounded Consumption: The Cost, Latency, and DoS Failure Mode

OWASP LLM10 Unbounded Consumption, in the 2025 OWASP Top 10 for LLM Applications, replaced the older "Model Theft" category and captures a broader failure mode: a workflow consumes model resources without effective bounds, leading to runaway cost, degraded latency, denial-of-service for legitimate callers, or wallet drain attacks against pay-per-token APIs. The category renaming reflected a real production pattern. The most reported LLM10-class incidents in 2024-2025 were not model exfiltration; they were cost-of-inference incidents that took down internal AI services or generated five-figure unexpected bills. I want to walk through the failure modes LLM10 actually covers, the controls a policy gateway enforces at the request boundary, and the operational patterns that hold up under load.

The 2025 OWASP guidance lists six specific consumption patterns under LLM10: variable-length input flooding, denial of wallet via overpaying for legitimate-looking requests, continuous generation loops, resource-intensive model selection, side-channel cost amplification, and unauthorized model usage. Each pattern lands in production differently and requires a different control mix.

What unbounded consumption looks like in production

A few representative incidents from public postmortems.

A SaaS company shipped an AI feature backed by GPT-4 with no per-tenant rate limits in the API gateway. A bug in a customer's automated workflow caused a retry loop that called the feature 47,000 times in two hours. The bill was $18,400 over baseline before the on-call engineer noticed. The bug was the customer's; the cost was the SaaS company's.

An internal copilot at a financial-services firm allowed engineers to call any model in the provider's catalog. One engineer ran a benchmark on the most expensive model variant for three weeks. The benchmark consumed 31% of the team's monthly inference budget. No policy prevented the call; no alert fired until the budget was 80% spent.

A public AI tool was hit with a coordinated abuse campaign that submitted thousands of long-context prompts designed to maximize token consumption per request. The legitimate user experience degraded to 60-second response times for 14 hours before the team added input-length caps.

The pattern across these incidents is that the application or workflow had no enforced boundary on consumption, the model provider's rate limits were too coarse to catch the failure mode, and the audit trail of "who consumed what" was reconstructable only with significant effort after the fact.

The control points a policy gateway enforces

A gateway between authenticated callers and the model has three relevant control points for LLM10.

The first is per-identity rate limiting, expressed in calls per minute, calls per hour, tokens per minute, and tokens per hour. The identity is the caller's verified principal, which can be a user, an agent, a service account, or a per-tenant identity. The limits apply at the gateway regardless of what the application logic does. A retry loop in the application that calls the gateway 47,000 times hits the gateway's rate limit and returns errors to the caller before reaching the model.

The second is model authorization per identity. Some identities are authorized to call the cheap models; some are authorized to call the expensive models; some are authorized to call both with different budgets. The authorization lives in the policy layer, evaluated per request. An engineer running a benchmark on the most expensive model is gated to that authorization, and the gate fires before the call.

The third is request-shape validation. The gateway can enforce caps on input token count, maximum context window allocation, and output token request limits. A prompt designed to maximize token consumption per request is bounded by the input cap before it reaches the model. The legitimate user experience does not degrade because the bound applies uniformly.

What the gateway cannot do alone

A gateway sees the request and the response. It does not have visibility into the application's intent. A legitimate batch processing job that calls the model 10,000 times in an hour looks identical at the request layer to an abusive workflow doing the same thing. The bound the gateway applies is policy, not detection. The policy needs to be set with the intended workloads in mind.

A gateway also does not handle the model provider's own quota mechanics. If the provider rate-limits the gateway as a single tenant, the gateway sees errors from the upstream and has to surface them gracefully. The gateway's per-identity limits should sit inside the provider's overall quota to avoid the gateway-imposed limit being tighter than the provider-imposed limit at peak.

Per-tenant cost isolation

A common LLM10 pattern in multi-tenant SaaS is the noisy-neighbor problem. One tenant's workload can degrade another tenant's experience by consuming a disproportionate share of the model provider's rate limit. The gateway can isolate tenants by enforcing per-tenant rate limits and per-tenant token budgets, refusing requests once the tenant's allocation is exhausted.

The implementation pattern is straightforward. The gateway extracts a tenant identifier from the request (header, JWT claim, or routing rule). The policy table maps tenant identifier to allocation. The decision is made before the upstream call. The audit record includes the tenant identifier so that downstream cost attribution is mechanical.

Wallet drain and side-channel cost amplification

Two LLM10 patterns specific to pay-per-token APIs deserve their own treatment.

Wallet drain is the case where an attacker submits requests that maximize the bill without producing useful work for the attacker. The economic incentive is to damage the target organization's budget. The defense is per-identity rate limiting plus per-identity budget caps. The budget caps trigger a hard refuse once the identity has consumed its allocation in a billing window. The attacker's incentive disappears because the bill stops growing.

Side-channel cost amplification is the case where a long-context prompt or a high-output-token request consumes significantly more resources than the request shape suggests. The defense is the request-shape validation described above, plus telemetry that surfaces unusual token-per-request ratios. The gateway is the layer that has the per-request token count for both input and output.

How LLM10 maps to EU AI Act obligations

The EU AI Act Article 26 deployer obligation includes monitoring operation and suspending use when the system presents risks. An LLM10 incident that takes down a high-risk AI system creates a deployer-side obligation to suspend use and investigate. The gateway audit trail is the evidence the deployer needs. The records show which identities consumed the resources, which model versions were involved, and the timing of the consumption spike.

The Article 19 logging obligation requires records of the system's operation. An LLM10 incident is part of the operational record. The gateway log retained for the Article 19 retention period (six months minimum) is the artifact the deployer presents to the regulator.

A production pattern that holds up

Three principles recur across deployer implementations that survived an LLM10 incident.

First, the rate limits are per identity, not per IP or per API key alone. Identity is the durable abstraction; IP and key rotate. The gateway extracts identity from the verified principal and applies the limit per principal.

Second, the budget caps are enforced in band, not by alert. A budget that fires an alert at 80% consumption does not prevent the runaway. A budget that refuses the request at 100% consumption does. The refusal at the gateway is the deterministic control.

Third, the audit record includes the consumption metrics. Every request log includes the input token count, the output token count, the model selected, the policy version, and the budget remaining for the identity at the moment of the request. Post-incident reconstruction is mechanical when the metrics are in the log.

DeepInspect

This is the LLM10 control surface DeepInspect operates on. DeepInspect sits inline between authenticated users or agents and the LLMs they call, applies per-identity rate limits and budget caps, validates request shape against configured caps, and writes a per-decision audit record with consumption metrics attached.

For the gateway-enforceable subset of LLM10 (rate limiting, model authorization, request-shape validation, per-tenant isolation, wallet-drain defense), DeepInspect produces the in-band controls and the audit records. The application-side and provider-side controls remain where they sit; DeepInspect adds the deterministic enforcement layer between them.

If you are mapping OWASP LLM10 controls for a production deployment or for an August 2 deployer-readiness review, book a demo today.