Why not just rely on the vector store's filter for access control?

The vector store filter is the first layer. It works well when the access-control model is simple and the filter is configured correctly. Production deployments accumulate authorization dimensions over time, and the filter falls behind. A second layer at the gateway catches the failures of the first layer. The second layer also produces the audit trail of which chunks were redacted, which the vector store filter does not produce on its own.

What is the latency cost of chunk-level redaction at the gateway?

The redaction runs on the request body before forwarding to the LLM. The check per chunk is a small policy evaluation in single-digit milliseconds. A request with ten chunks adds tens of milliseconds. The cost is small relative to the LLM inference time, which is typically hundreds of milliseconds to seconds. Production deployments measure the marginal latency in single-digit milliseconds on the assembled-prompt path.

Can the model still leak data if the redaction happens after the chunks are in the prompt?

The redaction replaces the disallowed content in the prompt before the prompt reaches the model. The model never sees the redacted span. A redaction marker is in its place. The model's response cannot echo content that was not in the input. The leak surface is closed at the boundary.

How do you handle re-ranking and hybrid retrieval shapes?

The architecture is the same regardless of the retrieval shape. Whether the chunks come from a single dense vector store, a hybrid dense-and-sparse retriever, or a multi-stage re-ranker, the chunks arrive at the gateway as part of the assembled prompt with metadata attached. The gateway runs the policy against the metadata. The complexity of the retrieval pipeline is upstream of the gateway and does not change the redaction logic.

Does this work for streaming RAG where chunks arrive over time?

Some applications stream chunks into the prompt as they are retrieved. The architecture is similar. The gateway buffers the chunks as they arrive, applies the policy on each one, and forwards the redacted prompt when the assembly is complete. The streaming case adds chunk-level state to the gateway but does not change the policy model.

AI Gateway Redaction for RAG Contexts: Stopping Cross-Tenant Data Leakage

Retrieval-augmented generation is the production pattern for grounding LLM responses in proprietary data. The application takes the user's question, queries a vector store, retrieves the top-k most relevant chunks, concatenates them into the prompt context, and sends the assembled prompt to the LLM. The pattern works well when every retrieved chunk is authorized for the requesting user. The pattern leaks data when retrieval pulls chunks the user was not entitled to see.

The model has no way to distinguish authorized chunks from leaked chunks. Once the data is in the context, it is in the response.

I want to walk through how RAG pipelines leak across tenants and roles, where the redaction has to happen architecturally, and what the gateway-side enforcement pattern produces in the audit trail.

How RAG context assembly works

The application embeds the user's question, queries the vector store for the nearest-neighbor chunks, and assembles the prompt.

The filter argument on the vector store query is the access-control hook. If the filter is correct, only chunks the user is authorized to see come back. If the filter is missing, wrong, or bypassed, chunks from other tenants come back and end up in the context.

The pattern looks safe. In production it fails in specific ways.

Where RAG pipelines leak

Cross-tenant and cross-role leakage in RAG is usually an integration failure, not a model failure.

The filter is set on the wrong field

A vector store can index chunks with metadata fields like tenant_id, customer_id, case_id, and role. The filter has to match the authorization model the application uses. A filter on tenant_id is correct for a tenant-isolation model and incorrect for a per-case access control model where users in the same tenant have different case-level permissions. Production deployments routinely add new authorization dimensions and forget to update the retrieval filter.

The filter is applied per query but the index allows traversal

Some vector stores apply the filter as a post-query mask rather than an index-level constraint. The query returns the top-k chunks before the filter and then masks the disallowed ones. The application sees the filtered results but the underlying retrieval saw all the data. A misconfigured filter that returns the chunks anyway leaks.

The chunks themselves are improperly tagged

Documents ingested into the vector store carry metadata derived from the source. A document that came from a customer-facing folder gets tagged with the customer's tenant. A document that contains customer-facing text and internal-only text gets tagged with one or the other and the wrong portion leaks. The metadata model has to match the access-control granularity, and the ingestion pipeline has to enforce it. In practice, the ingestion is upstream of the access-control review.

The retrieval bypasses are not audited

The vector store query that the application runs is invisible to the access-control logs. A user who triggered a query that fetched cross-tenant chunks does not show up in the audit trail of the access-control system. The leak goes undetected until a manual review.

Where the redaction has to happen

The architectural position for the redaction is at the boundary between context assembly and the LLM call. The gateway sees the assembled prompt as part of the LLM request body. Each retrieved chunk can be tagged with metadata, the gateway can match the chunk metadata against the user's identity, and the gateway can redact or block before the prompt reaches the model.

The architectural alternative is to move the access control entirely into the vector store and the ingestion pipeline. That works for some deployments and fails for others. The gateway-side redaction adds a second layer that catches the failures of the first layer.

The chunk metadata has to be carried into the request

The application has to forward the chunk metadata as part of the LLM request, either as a structured field the gateway can parse or as inline markers in the assembled prompt. The shape varies by gateway. A common pattern is a JSON sidecar on the request that lists each chunk with its source document, its metadata tags, and its position in the assembled prompt.

The gateway parses the sidecar, checks each chunk against the user's authorization, and redacts the chunks that do not match.

The redaction operates on the prompt body

A chunk that fails the policy check has its bracketed span replaced with a redaction marker before the request reaches the model. The model sees a prompt where the disallowed content has been removed, and the response cannot echo what is not in the prompt.

Backfill audit trail

The chunks the gateway redacted, the chunks it permitted, and the policy version that produced the decision go into the per-decision audit record. A regulator or an internal investigator can reconstruct which chunks made it into the model context for a specific request.

The policy shape for RAG chunk redaction

The policy that holds at the chunk layer attaches to the user's identity, the chunk's metadata, and the deployment's data classification model.

The policy is composable. A chunk has to satisfy every rule that applies to it. A failure mode of "redact" replaces the chunk with a marker. A failure mode of "deny_request" rejects the entire LLM request and returns a policy denial to the application.

The audit record for a RAG request

The per-decision audit record captures the RAG context assembly and the redaction decisions.

The record shows that the assembled prompt had eight chunks, one was redacted at the gateway because its tenant_id did not match the user's tenant_id, and seven were permitted into the final prompt. The signature makes the record tamper-evident.

DeepInspect

This is the RAG redaction pattern DeepInspect was built around. DeepInspect sits at the AI request boundary, parses the RAG context sidecar attached to the request, applies identity-bound policy against each chunk's metadata, redacts or denies based on the policy mode, and produces a per-decision audit record that captures the chunk-level decisions.

For deployments running RAG over multi-tenant data under EU AI Act high-risk classification or HIPAA-covered workloads, the chunk-level decisions are part of the audit evidence the regulator inspects. Application-level audit logs that record the LLM request and response do not capture the chunk-by-chunk authorization. The gateway-side record does.

If your AI deployment uses RAG and your enforcement model trusts the vector store filter to handle access control, the chunks that slip through are in the model context with no second layer to catch them. Book a demo today.