RAG Prompt Injection: How the Retrieval Step Becomes the Attack Surface
RAG prompt injection turns the retrieval step into the attack surface. Adversarial content inside a retrieved document reaches the model context with the same trust level as the application instructions. The model has no architectural way to distinguish trusted spans from untrusted spans. This piece walks through the four retrieval paths that open the surface, the failure modes the model alone cannot close, and the inspection-layer controls that produce a deterministic decision and an audit record EU AI Act Article 12 reviewers will accept.

RAG prompt injection turns the retrieval step into the attack surface. Retrieval-augmented generation systems read documents from a vector store, a keyword index, or a search backend into the model context window before the model reasons. The retrieval step is the seam where content of varying provenance enters the prompt. The model has no architectural way to distinguish trusted application instructions from untrusted retrieved content because both arrive in the same context window with the same role. OWASP catalogs the attack class under LLM01 and treats RAG as a primary attack path. The inspection layer at the HTTP boundary between the retrieval output and the model call is the only point where the seam can be enforced deterministically and audited.
I want to walk through the four retrieval paths I have seen open the surface in production, where the model defenses fall short, and the architectural pattern that produces a defensible posture under EU AI Act Article 12 and DORA Article 19 review.
Why the retrieval step opens the surface
The application builds the prompt by concatenating the system instructions, the user query, and the retrieved chunks. The retrieved chunks arrive from a store that may contain content authored by the organization, content uploaded by users, content scraped from public sources, content imported from vendors, and content placed in the store by other agents. The trust levels differ. The model treats the union as a single context.
The attacker's payload sits inside the retrieved chunk. The application's content filter inspected the user query and approved. The retrieval engine returned the chunk because it matched the embedding similarity threshold. The model reads the payload and may follow the embedded instructions. The application never saw a payload because the payload was not in the user query.
The architectural failure is the lack of a labeled boundary between trusted and untrusted spans in the prompt. The inspection-layer response treats the retrieved span as a distinct trust zone, applies a separate policy to it, and commits an audit record that names the chunk source.
Path 1: poisoned chunks in the vector store
The first path is direct corpus poisoning. The attacker places a document into the corpus that contains adversarial instructions inside otherwise plausible content. The document is indexed. A future query that embeddings-matches the document retrieves it. The model reads the payload.
The poisoning channel may be a user upload feature, a public-content scraper, a vendor data feed, or a federated index that aggregates corpora the organization does not control. The corpus owner has limited ability to audit every chunk for injection patterns at scale. The inspection-layer response is a chunk-classification pass that evaluates every retrieved chunk against injection pattern matchers before it enters the prompt.
Path 2: indirect injection through user-uploaded documents
The second path is user-uploaded indirect injection. The user uploads a PDF, a Word document, or a web page into the RAG application. The application chunks the document, embeds the chunks, and indexes them. Subsequent queries retrieve the chunks. The model reads the embedded payload.
The attack pattern is well-documented: white-on-white text inside PDFs, zero-width Unicode characters inside HTML, instructions formatted as document comments or invisible metadata, prompt payloads inside image alt text the multimodal model reads. I covered the attack pattern in the indirect prompt injection breakdown.
The inspection-layer response evaluates uploaded content at ingestion time and at retrieval time. The two-stage check catches payloads the ingestion pass missed because the embedding pipeline canonicalized the document differently than the model will canonicalize it during retrieval.
Path 3: query-time web fetch
The third path is the agentic RAG pattern where the application fetches content at query time. The user asks a question, the application performs a web search, fetches the top results, and feeds the content to the model. The fetched page is attacker-controlled by definition.
The attack pattern includes SEO-poisoned pages where the visible content is benign and the hidden content contains the payload, pages with prompt injection in the metadata or the HTML comments, and pages that respond differently to model crawlers than to human browsers. The fetched content reaches the model with the same role and the same trust level as the application instructions.
The inspection-layer response evaluates the fetched page content separately from the user query, applies the stricter policy, logs the source URL, and rejects content matching the disallow list. The check fires for every fetch in an agentic loop, not only the first one.
Path 4: tool-output retrieval in agent pipelines
The fourth path is the tool-output retrieval pattern in agentic RAG. The agent calls a tool (a database query, an API call, a code execution), reads the result, and uses the result as context for the next model call. If the tool surface returns attacker-controlled content, the model reads the payload.
Common surfaces include database fields that contain user-supplied text, API responses from external services with attacker-influenced parameters, and code execution outputs where the script printed attacker-influenced strings. I covered the agent-loop variant in the agentic AI workflows analysis.
The inspection-layer response evaluates the tool output before the agent reads it into the next prompt, applies the stricter policy, and commits an audit record that names the tool source.
Why model-side defenses do not close the surface
Frontier models trained with RLHF or constitutional methods reduce the rate of compliance with overt injection payloads. The training is probabilistic. It does not enforce the organization's specific policy on the retrieved content. It cannot distinguish a benign quote of an injection-like phrase from an actual injection attempt. It produces no audit record that names the chunk source.
I argued the position in the model guardrails analysis. The model's safety training and the application's prompt construction discipline together reduce the surface. The inspection layer at the HTTP boundary closes the residual surface and produces the audit record.
What the audit record has to contain
EU AI Act Article 12 requires automatic recording of events over the lifetime of the system. The records must identify the natural person involved, capture the input data, and reconstruct the decision. For a RAG application, the input data is the union of the user query and the retrieved chunks. The record must capture which chunks were retrieved, from which sources, and how each chunk was classified.
The audit record that holds up under review carries the identity, the role, the user query, the retrieved chunk identifiers, the chunk sources, the classification outcome for each chunk, the policy version, the decision outcome, and a cryptographic signature. The record is committed before the model receives the prompt. The application never has custody of the write path.
DeepInspect
This is the architecture DeepInspect was built to provide. DeepInspect sits inline at the HTTP path between the application and any LLM. For RAG applications, the inspection layer evaluates the retrieved content at ingestion time and at retrieval time, applies a separate policy to each chunk, and commits the per-decision audit record. The inspection layer also evaluates the model output before the application returns it to the user.
DeepInspect is model-agnostic and retrieval-agnostic. The same enforcement layer protects RAG applications built on Pinecone, Weaviate, Qdrant, pgvector, OpenSearch, or any other store. The policy primitives are identical because the attack surface at the HTTP boundary is identical.
If your RAG application runs on application-controlled defenses, the residual indirect injection surface is broad. Run the free AI Readiness Check to see where the gaps sit in your stack.
Frequently asked questions
- Can I prevent RAG injection by sanitizing the corpus?
Corpus sanitization at ingestion reduces the rate of known-pattern payloads. The sanitization runs against the corpus the organization controls. Payloads that arrive through user uploads, agentic web fetches, or tool outputs do not pass through the ingestion sanitizer because they enter the prompt at query time. The runtime inspection layer is the only point where every chunk is evaluated regardless of how it entered the prompt. The two layers complement each other and both are necessary in a regulated RAG deployment.
- What about chunk filtering by similarity threshold?
Similarity thresholds are an embedding-quality measure. They do not detect adversarial content. A chunk that is semantically relevant to the query may still contain an injection payload. The threshold reduces the rate at which off-topic chunks reach the prompt; it does not classify chunk content for injection patterns. The inspection layer's classifier runs orthogonal to the embedding similarity check.
- How does the inspection layer handle long context windows in RAG?
The inspection layer scans the full retrieved content rather than a head sample. The latency budget supports the scan; from internal DeepInspect testing the enforcement overhead remains under 50 ms even at large context lengths. The architectural pattern that matters is the deterministic classification of every chunk, with the verdict committed to the audit record before the model receives the prompt.
- Does the inspection layer work with multi-modal RAG?
Multimodal RAG that retrieves images, audio, or video as context carries an additional injection surface: adversarial spans embedded in image metadata, audio transcripts, or video frame text. The inspection layer applies pattern matchers and content classifiers against the multimodal inputs in the same way it applies them against text chunks. The audit record names the modality and the source.
- What if the application uses RAG only against internally-controlled content?
Internally-controlled corpora reduce the corpus-poisoning surface. They do not eliminate the user-upload, agentic-fetch, or tool-output paths. They also do not produce the audit record the EU AI Act Article 12 reviewer expects. The inspection layer is necessary for the audit-record function even when the corpus is fully under the organization's control. The architectural value of the layer is the per-decision record and the deterministic policy evaluation, not only the injection detection.