Stateless vs Stateful AI Proxy: Which Architecture Holds Up Under Production Load and Audit
A stateless AI proxy makes the policy decision on the contents of the current request and the per-decision audit record alone. A stateful AI proxy carries session memory, caches conversation history, or stores prompts across requests in its own storage. The choice has direct consequences for horizontal scaling, blast radius under compromise, the EU AI Act Article 12 record-keeping obligation, and the DORA third-party risk profile of the inspection layer. This piece walks through the architectural distinction, what each option requires from the deployment, and where most production teams settle once the trade-offs are visible.

A stateless AI proxy holds no information about a request once the audit record commits and the response forwards. A stateful AI proxy keeps something: a conversation memory, a token cache, a prompt history, a session map. The architectural choice has immediate consequences for horizontal scaling, the inspection layer's own attack surface, and the audit profile under EU AI Act Article 12 and DORA Article 6 review. The stateless design wins on every axis that an enterprise security team cares about. The stateful design wins on a small set of product features that are usually implemented outside the inspection layer anyway.
I want to walk through what each option holds, what each option implies for the deployment, and where production teams settle once the trade-offs are visible.
What stateless means in this architecture
A stateless inspection layer evaluates each request against the identity context, the data classification outcome, the policy bundle, and the model authorization at the moment the request arrives. The decision is a pure function of the request and the policy table. The audit record commits to durable storage. The response forwards. The inspection layer holds nothing else.
Three structural properties follow. The first is horizontal scaling without a coordination layer. Any inspection node serves any request from any tenant. A traffic spike triggers a scale-out, the new nodes serve traffic on first start, and the load balancer fans out across the pool. No leader election, no sticky sessions, no in-memory state to warm up.
The second is a clean failure mode. An inspection node that crashes mid-request loses exactly that request. The client retries with a fresh request, a different node picks it up, and the audit record commits as normal. No conversation history to reconstruct. No session state to recover. No half-committed prompt sitting in a node's local memory.
The third is a minimal attack surface. The inspection layer holds no prompt history that a compromise could exfiltrate. An attacker that breaches an inspection node gets the in-flight request and the audit record's signing key, both of which are scoped problems. The attacker does not get a month of conversation history because the inspection layer never stored a month of conversation history.
What stateful means in this architecture
A stateful inspection layer holds something across requests. Three common patterns appear in production.
The first is conversation memory. The inspection layer keeps a representation of the conversation so it can apply a policy that references prior turns. The pattern shows up when policies say things like "if the user has already received a refusal in this session, escalate the next decision." The cost is that the inspection layer is now a data store with conversation contents and has to be treated as one.
The second is response caching. The inspection layer keeps recent prompt-and-response pairs so it can return a cached response if an identical prompt arrives. The pattern appears in deployments that want to reduce inference cost. The cost is that the cache is a data store with prompt contents and has to satisfy the same data protection regime as the model provider.
The third is rate-limit accounting at the proxy layer. The inspection layer holds counters per identity, per route, and per time window. The accounting can be made functionally stateless by externalizing the counter store to a key-value system, in which case the inspection layer is stateless and the storage layer carries the state. Most production deployments adopt this split.
The horizontal scaling difference
Stateless inspection layers scale by adding nodes. A 10x traffic increase needs 10x the nodes, the load balancer fans out across the pool, and no inspection node carries privileged knowledge that the others lack. The deployment runs on the same horizontal scaling pattern as a stateless HTTP service.
Stateful inspection layers scale through sharded state. The sharding key is usually the tenant identifier or the session identifier. A given node owns the conversation memory for a subset of sessions. When the node fails, the sessions it owned have to migrate to a different node, and the migration window is a period of degraded service for those sessions. Operators have to design for the partition: hot shards that take disproportionate traffic, sticky-session routing rules that break under DNS-level load balancing, and the rehydration cost when a session lands on a cold node.
The horizontal scaling difference compounds at production volume. A deployment that serves 10,000 requests per second on stateless inspection nodes runs on a thin pool. The same deployment on stateful inspection nodes runs on a pool sized for the worst-case shard plus the rehydration overhead.
The blast radius difference
A stateless inspection node compromised by an attacker exposes the in-flight requests at the moment of compromise and the signing key for audit records. Both are scoped problems. The in-flight requests are visible to the attacker until they pass through the node, which is a window of milliseconds. The signing key is rotatable; rotation invalidates the attacker's forgery capability and the inspection layer recovers within the rotation cycle.
A stateful inspection node compromised by an attacker exposes the conversation history sitting in the node's storage, the response cache, and any other state the node maintained. The conversation history is the prompt content, which is the data the inspection layer was deployed to protect. The breach is the breach the architecture was meant to prevent.
The DORA Article 6 operational resilience review treats the inspection layer as a third-party data store when the layer carries prompt content across requests. The third-party risk profile of a stateful inspection layer is meaningfully different from the profile of a stateless one. Security teams that run the threat model land on stateless for this reason regardless of the product features the stateful option would enable.
The audit profile difference
EU AI Act Article 12 expects records that bind a specific decision to a specific identity, policy state, and outcome. The record format is request-scoped. The stateless inspection layer produces this record by construction: every request produces one record, the record carries everything the regulator wants, and the record is independent of any other record.
The stateful inspection layer can produce the same record format. The challenge is that the record now references conversation state that lives in the inspection layer's storage, and the auditor's reproducibility question becomes "what was the conversation state at decision time." The inspection layer has to preserve the conversation state at the moment of decision for as long as the audit record's retention period. The storage cost compounds, and the data protection regime that applies to the conversation storage layer applies to the inspection layer.
Most production deployments choose stateless inspection with conversation state, when needed, carried by the calling application and passed to the inspection layer on each request. The audit record records what was passed, the inspection layer evaluates the policy against what was passed, and the conversation storage lives outside the inspection layer. The architectural separation matches the separation of duties that the audit regime expects.
Where most production teams settle
Production deployments running enterprise AI traffic through an inspection layer settle on stateless inspection nodes with externalized counter and cache storage. The pattern looks like this. Inspection nodes are stateless. Counters and rate-limit state sit in a key-value store. Response caches, when desired, sit behind a separate service that the inspection layer calls. The audit record commits to durable storage. Conversation state, when needed for policy, is carried by the calling application as a structured request field and evaluated against the policy bundle at decision time.
The deployment scales horizontally on the inspection tier without coordination. Storage tiers (counters, audit) scale on their own primitives. The threat model for the inspection layer is small and well understood. The DORA Article 6 review treats the inspection layer as a stateless control with clear failure modes. The EU AI Act Article 12 audit records are request-scoped and reproducible from the request and the policy table at decision time.
DeepInspect
This is the architecture DeepInspect runs. DeepInspect is a stateless inspection layer that sits between calling applications and any LLM endpoint over HTTP. Each request carries identity context, route identifier, and prompt content. The inspection layer evaluates the policy bundle against the request, commits the per-decision audit record to durable storage with cryptographic integrity, and forwards the request to the model. The inspection node holds nothing else.
Counters, rate limits, and audit storage live on their own scaling primitives. Conversation state, when needed, is carried by the calling application and evaluated at decision time, not held inside the inspection layer. The DORA third-party risk profile, the EU AI Act Article 12 audit profile, and the horizontal scaling envelope all benefit from the stateless choice.
If you are evaluating an AI inspection layer and the architecture diagram shows session memory inside the inspection nodes, let's talk about why we made the opposite call.
Frequently asked questions
- Can a stateless AI proxy enforce policies that depend on conversation history?
Yes, when the conversation history is carried by the calling application and passed to the inspection layer as a structured field on each request. The application maintains the conversation state (which it usually does already, because the model needs the history for response coherence) and includes the relevant fields on the request that the inspection layer needs to evaluate the policy. The inspection layer remains stateless and the policy can still reference prior turns. The audit record captures what was passed, which makes the decision reproducible. The pattern is the same one that JWT-based authentication uses: carry the context on the request, evaluate it server-side, hold no session state on the server.
- Why does the DORA Article 6 review treat a stateful inspection layer differently?
DORA Article 6 covers operational resilience and third-party ICT risk. The review classifies the inspection layer as a third-party data store when the layer carries data across requests. The classification triggers additional obligations: registers of contractual arrangements, exit strategies, concentration risk assessment, and incident reporting. A stateless inspection layer is classified as a control rather than a data store, which produces a lighter third-party risk profile. The classification difference shows up in the financial entity's annual filing and in the regulator's supervisory review, so the architectural choice has direct consequences for the customer's compliance posture.
- What happens to rate limits and counters in a stateless inspection layer?
Counters and rate limits live in an externalized key-value store that the inspection layer reads and writes on each request. The store is sharded by the counter key (tenant, route, identity) and serves the inspection tier without becoming a single point of failure. Production deployments use Redis, DynamoDB, or a similar key-value primitive sized to the request rate. The inspection node remains stateless and the counter state lives where it can be scaled, replicated, and backed up on its own primitives. The audit record stamps the rate-limit state at decision time so the auditor can reconstruct the policy outcome.
- What does horizontal scaling look like for a stateless inspection layer at 10,000 requests per second?
The inspection tier runs as a stateless HTTP service behind a load balancer. The pool size is determined by the per-node throughput and the headroom factor the operator chooses. A typical node serves 500 to 2,000 requests per second depending on the policy complexity, which means a 10,000 RPS deployment runs on 10 to 30 inspection nodes in the steady state. Autoscaling rules add or remove nodes based on the per-node CPU and latency. New nodes serve traffic immediately on start because no warm-up state is required. A bad deployment is rolled back by deploying the previous version to the pool. The horizontal scaling pattern is the same as a stateless API gateway.
- How does the stateless choice affect the audit record retention obligation?
The audit record is request-scoped and carries everything the regulator needs to reconstruct the decision. Retention applies to the audit record itself, not to any side state. The inspection layer's storage cost for retention is the cost of the audit log volume, which is bounded by the request rate and the retention period. A stateful inspection layer that held conversation history would have to retain the conversation history for as long as any audit record referenced it, which compounds the storage cost. The stateless choice keeps the retention envelope predictable and bounded.