Why a per-provider breaker instead of one global breaker?

The blast radius isolation is the point. A degraded provider should affect only the routes that depend on it, not every AI route in the gateway. A global breaker would short-circuit healthy providers when one provider fails. The per-provider design lets the gateway route around the failed provider through a fallback policy, which is the multi-provider failover input. The per-route override then handles the case where the same provider serves different routes at different sensitivity levels.

How does the breaker interact with retry logic?

The retry logic sits inside the breaker, not outside. When the breaker is closed, the calling code may retry on transient errors with backoff. When the breaker is open, the retry stops at the gateway because the gateway short-circuits before the retry reaches the provider. The combined behavior prevents the retry storm that aggravates a degraded provider. The classical Hystrix pattern documented this in 2012, and the same logic applies to LLM traffic.

What happens to in-flight requests when the breaker trips?

In-flight requests complete on the provider side. The breaker does not cancel them. The trip affects only new requests after the state transition. The completion records still write to the audit substrate. The post-trip telemetry then shows the tail of the in-flight requests as the breaker enters the cooldown window. The post-incident review reads the tail to understand the degradation duration.

Does the prompt-injection rate threshold work for all providers?

The threshold reads from the inspection layer's own classifier, which runs on the proxy regardless of the upstream provider. The signal is provider-agnostic. The threshold trips when the rate exceeds 5% over 60 seconds, which indicates either a coordinated injection campaign or a classifier-side issue. The breaker then short-circuits the affected route while the security team investigates. The classifier accuracy is the upstream gating control on the threshold.

How does the half-open probe avoid surprising end users?

The probe carries a header flag the inspection layer reads, and the gateway routes the probe through a designated test route or against an internal synthetic prompt. The probe does not flow user traffic during the recovery test. Once the four success criteria pass, the breaker closes and the production routes resume. The user-facing impact during half-open is the same as during open: the short-circuit error code the gateway returns.

AI gateway circuit breakers: limiting blast radius when an LLM provider degrades

The microservice circuit breaker pattern wraps a downstream call with a state machine that flips to open after a defined error rate, blocks calls for a cooldown window, then probes with a limited budget before flipping back to closed. The pattern dates to Michael Nygard's 2007 work and the Netflix Hystrix implementation in 2012. Applied to LLM traffic, the breaker carries three additional inputs over the classical implementation. Token cost, prompt-injection rate, and audit-write backpressure each become tripable signals. The breaker telemetry then becomes the source for the DORA Article 19 major ICT incident notification window.

I want to walk through the per-provider state machine semantics, the trip thresholds the AI breaker carries beyond the classical microservice version, the half-open probe budget, the breaker telemetry that feeds incident reporting, and the recovery audit trail the half-open transition writes.

State machine semantics

The breaker sits per-provider per-route. Three states. Closed is the default: requests flow through, the breaker measures the signals. Open is the failure mode: requests are short-circuited at the gateway, the calling application sees a defined error code, the LLM provider receives no traffic. Half-open is the recovery probe: a limited request budget flows through, the breaker measures whether the signals recover, the breaker transitions back to closed on success or back to open on failure.

The state transitions carry timestamps and a reason code. The closed-to-open transition records the trip threshold that fired. The open-to-half-open transition records the cooldown timer expiry. The half-open-to-closed transition records the probe-success criterion. The half-open-to-open transition records the probe failure. Each transition is a recovery audit trail row.

Trip thresholds

The classical microservice breaker trips on error rate over a sliding window. The AI gateway breaker carries the same input plus four AI-specific signals. The combined trip logic checks each signal against its threshold per evaluation window, typically 10 seconds.

The error-rate threshold trips when the provider's HTTP 5xx rate exceeds 25% over a 30-second window. The latency threshold trips when the p99 response time exceeds 8000 ms. The token-cost threshold trips when the rolling 5-minute token spend exceeds 200% of the seven-day baseline. The prompt-injection threshold trips when the classifier-detected injection attempt rate exceeds 5% of requests over a 60-second window. The audit-write backpressure threshold trips when the audit queue depth exceeds 10,000 records.

The thresholds are per-provider per-route configurable. The defaults are conservative. The breaker logs every threshold evaluation, not just the trips. The threshold evaluation log is the input to the post-incident review.

Half-open probe budget

The half-open state is the careful test that the provider has recovered. The probe budget is the number of requests the breaker allows through during the half-open window. The default is 10 requests over 30 seconds.

The probe requests carry a header flag the inspection layer reads. The flag tells the policy engine the request is a recovery probe. The policy engine applies the same authorization but treats the response telemetry as the probe result, not as production traffic. The probe success criterion requires the error rate under 5%, the p99 latency under the closed-state baseline, the token cost in the expected range, and the audit-write latency under 100 ms. All four criteria must pass for the transition to closed.

A probe failure transitions back to open with the cooldown doubled, up to a 30-minute cap. The pattern follows the exponential backoff used in the Polly resilience library and the standard cloud-provider retry guidance.

Breaker telemetry

The breaker emits four metric streams. The state stream records the current state per provider per route, sampled every second. The transition stream records each state change with the reason code and the threshold evaluation that fired. The throughput stream records the requests-allowed and the requests-short-circuited count. The signal stream records the input metrics, the error rate, the p99 latency, the token cost rolling sum, the injection rate, and the audit queue depth.

The four streams feed three dashboards. The operations dashboard shows the current state across all providers, color-coded. The capacity dashboard shows the short-circuited request count tied to the gateway routing fallback, which is the input to the multi-provider failover. The incident dashboard pulls the transition stream into the DORA Article 19 incident detection pipeline.

DORA Article 19 incident reporting tie-in

DORA Article 19 requires financial entities to report major ICT-related incidents to the relevant competent authority. The initial notification window is four hours from classification. The intermediate report follows within 72 hours. The final report follows within one month.

A circuit-breaker trip on a primary AI provider is a candidate ICT incident under the DORA Article 18 classification criteria when the trip affects critical services, persists past the cooldown, or generates client-facing degradation. The breaker telemetry feeds the classification pipeline with three inputs: the trip timestamp, the affected route population, and the cooldown progression. The incident response team receives the candidate notification within minutes of the trip. The four-hour Article 19 clock starts on classification, not on detection, but the breaker telemetry compresses the time to classification to under 15 minutes.

YAML config example

The breaker config sits per-provider in the gateway. Each route inherits the provider default and can override per-threshold.

DeepInspect

This is the inspection layer that hosts the AI gateway circuit breaker per provider per route. DeepInspect sits at the AI request boundary as a stateless proxy between authenticated users or agents and any LLM endpoint. The breaker state machine, the trip threshold evaluation, the half-open probe budget, and the breaker telemetry all live on the proxy. The audit-write backpressure signal feeds directly from the proxy's own audit queue depth, which closes the loop on the substrate the cluster 1 reports depend on.

The breaker transitions become rows in the recovery audit trail with the same signature chain the per-decision records carry. The DORA Article 19 classification pipeline reads the transition stream and produces the candidate notification within 15 minutes of the trip. The financial-entity incident response team receives the input the Article 19 four-hour clock requires before the clock starts.

Book a technical deep dive at deepinspect.ai.