← Blog

AI Gateway Cache Invalidation: When a Cached Prompt Response Becomes a Data Leak

AI gateways cache prompt responses to cut cost and latency. The cache lookup uses a hash of the prompt as the key, which means two callers with different authorization scopes can hit the same cache entry. This piece walks through the failure mode, the identity-scoped cache-key patterns that avoid it, and the inspection-layer architecture that makes cache lookup safe.

ByParminder Singh· Founder & CEO, DeepInspect Inc.
Platform & Architectureai-gatewaycachingai-engineeringdata-leakai-security
AI Gateway Cache Invalidation: When a Cached Prompt Response Becomes a Data Leak

An AI gateway caches prompt responses to cut cost and latency. The typical cache lookup uses a hash of the prompt (and often the model name and parameters) as the key. Two callers who send the same prompt hit the same cache entry. When the two callers have different authorization scopes, the caller with less scope receives the response that was generated for the caller with more scope. The pattern is a data leak, and it is invisible to the model provider because the model was never called on the second request.

The failure mode has been in production AI gateways since 2023 and is one of the more common findings in enterprise AI security reviews. The fix is a cache-key construction that includes identity and scope, and an inspection-layer architecture that runs the identity check before the cache lookup, not after.

I want to walk through the failure mode, the cache-key patterns that avoid it, and the inspection-layer architecture that makes cache lookup safe under an Article 12 audit.

Failure mode

Two callers, Alice and Bob, both use the same AI gateway. Alice has authorization to query the internal HR-policy corpus through a retrieval-augmented tool. Bob has authorization only to query the public product catalog. Both send the prompt "what is our current parental leave policy?" to the gateway.

Alice's request runs through the retrieval tool. The retrieved HR-policy content is prepended to the prompt as context. The augmented prompt is [HR policy retrieval context] + what is our current parental leave policy?. The model produces a response that quotes the policy. The gateway caches the response under a key derived from the augmented prompt.

Bob's request runs through the same gateway. The retrieval tool call skips (Bob is not authorized). Bob's prompt reaches the gateway as what is our current parental leave policy?. The gateway's cache-key logic derives the key from Bob's prompt. If the cache-key logic operates on the raw user prompt (before augmentation), the derived key matches Alice's cache entry, and Bob receives Alice's response. The internal HR policy leaks to Bob.

The variant that leaks even without retrieval augmentation runs when two callers with different data classifications send the same prompt to different models. The gateway routes based on the callers' authorization. Alice's prompt goes to a fine-tuned internal model. Bob's prompt goes to a public model. If the cache-key logic keys off the prompt without the model routing, Bob's cache lookup can hit Alice's fine-tuned model response.

Cache-key construction

The safe cache-key construction includes every input that affects the response the model would produce. The minimum set is prompt, model, model version, and parameters. The safe set adds identity scope and augmentation state.

Naive cache key (unsafe)

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

This key ignores parameters, augmentation, and identity. The cache leaks across every difference in those dimensions.

Parameter-aware cache key (still unsafe)

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

This key covers the model call parameters. It still leaks across identity and augmentation.

Scope-scoped cache key (safe)

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The scope component captures the identity's authorization. Two callers with different scopes derive different scope hashes and land in different cache buckets. The augmentation component captures the retrieval context and tool-call state so a request that runs through augmentation and a request that skips augmentation land in different buckets.

Inspection-layer architecture

The safe cache lookup runs after the identity has been resolved and the policy has been evaluated, not before. Four architectural properties make the cache safe.

Identity resolution runs before the cache lookup

The inspection layer resolves the caller's identity (subject, roles, groups, scopes) before the cache lookup runs. The scope hash is available at cache-key derivation time. A cache lookup that runs before identity resolution loses the scope input and produces a leaky key.

Retrieval augmentation runs in the augmentation phase

The inspection layer decides whether the request runs through retrieval augmentation based on the identity's authorization. Alice's request runs through the HR-policy retrieval tool. Bob's request skips it. The cache key incorporates the augmentation state, so the two requests land in different buckets even when their raw prompts match.

Policy evaluation runs before the response is served from cache

A response served from cache is still subject to the response-side policy. The inspection layer runs the output classification and applies redaction to the cached response before returning it. The cached response is not exempt from response-side policy just because the model was not called.

Cache invalidation is policy-driven

The cache invalidates on policy changes that could affect the response the identity is authorized to see. A change to the HR-policy retrieval scope invalidates the cache entries that carry the HR-policy augmentation. The invalidation is triggered by the policy change event, not by a TTL.

Compliance implications

The cache-leak failure mode has Article 12 implications. The audit log for Bob's cached response has to reflect that the response was served from cache, that the response content was generated in response to a different request, and that the response has been evaluated against Bob's scope on delivery.

The Article 19 identity requirement requires the log to identify the natural person involved. For a cache hit, the log records Bob's identity as the caller and Alice's identity as the response originator. The two-identity record shape lets the auditor reconstruct the cache hit and verify that Bob's scope authorized the delivery.

The GDPR data-minimization principle applies to the cache. A response that contains personal data has to be delivered to only the identities authorized for that data. The scope-scoped cache key is the enforcement mechanism. A cache key that omits scope violates the principle by construction.

DeepInspect

This is exactly what DeepInspect does. DeepInspect sits inline between your users or agents and the LLM APIs they call. Identity resolution runs at the inspection layer before any cache lookup. The scope-scoped cache key uses the identity's roles, groups, and data scopes as inputs. Policy evaluation runs on cache hits, and cache invalidation triggers on policy changes.

The cache is a cost-and-latency optimization that operates under the same enforcement layer as every other AI request. The audit record for a cache hit carries the caller's identity, the originator's identity, and the response-side policy evaluation.

Book a demo today.

Frequently asked questions

Does scope-scoped caching hurt the cache hit rate significantly?

The hit rate drops relative to a naive prompt-only cache, and the drop varies by workload. Deployments with a small number of scope classes and high prompt reuse within each scope retain most of the hit rate. Deployments with many fine-grained scope classes see a larger drop. The tradeoff is safety for hit rate, and the safety side is the default.

What TTL should apply to cached AI responses?

The TTL depends on the content class. Public reference content tolerates hours to days. Internal-policy content tolerates minutes to an hour. Personal-data-derived content tolerates seconds or nothing at all. The TTL is a policy configuration, and the inspection layer enforces it as a cache attribute.

Does the cache still hold personal data?

Yes, if the responses contain personal data. The cache is subject to the same retention rules as any other personal-data store. The GDPR data-minimization principle and the applicable retention rules apply to the cache the same way they apply to the audit log. The cache storage layer has to satisfy the same access controls as the primary data store.

How does streaming affect the cache?

Streaming responses are cached after the stream closes. The full response text is the cached artifact. Partial streams are not cached. A stream that closes on client disconnect before completion is discarded, not cached.

Can the cache be shared across models?

Only when the model versions and parameters match. A response generated by GPT-4o at temperature 0.7 is not interchangeable with a response from Claude 3.5 Sonnet at temperature 0.0. The cache key includes the full model identifier and parameters, so cross-model sharing is by construction impossible.

What logs does the cache emit?

The cache emits cache-hit and cache-miss events into the audit stream. A cache-hit event carries the current caller's identity, the originator's identity, the cache-key components, and the policy state applied on delivery. A cache-miss event indicates the request forwarded to the model. Both event types feed the SIEM alongside the model-call audit records.