Indirect Prompt Injection: How RAG and Tool-Use Pipelines Get Compromised Through Retrieved Content
Indirect prompt injection is the attack pattern where adversarial content reaches the model through a retrieved document, a tool result, or any other source the model treats as part of its context. The attacker never interacts with the application directly. The injection succeeds when the model executes the embedded instructions on the next retrieval or the next agent loop iteration. RAG pipelines and tool-using agents are exposed by construction. This piece walks through the attack mechanics, the surface area in production deployments, why the model alone cannot defend, and the request-boundary controls that produce a defensible posture.

Indirect prompt injection moves adversarial content into the model's context through a path the user did not control. The user submits a benign question, the application retrieves a document or calls a tool, and the model's next decision reads instructions the attacker placed in the retrieved content. The injection executes without the attacker interacting with the application directly. The pattern is the dominant attack vector for retrieval-augmented generation (RAG) pipelines and tool-using agents in production today. OWASP's LLM Top 10 ranks prompt injection as LLM01 and the indirect variant is responsible for most of the production incidents I see in the field.
I want to walk through the attack mechanics, the surface area in modern AI workflows, why model-side defense fails by construction, and the request-boundary controls that produce a defensible posture.
The attack mechanics
The attack has three steps. The attacker places adversarial content in a source the target application retrieves. The application retrieves the source and passes it to the model as context. The model treats the source content as instructions and executes them.
Three sources cover most of the production attack surface. The first is the corpus that a RAG pipeline indexes. A public web page, a customer-uploaded document, a third-party feed, a knowledge base entry the attacker has write access to. The content reaches the corpus through the application's normal ingestion pipeline. The injection waits in the corpus until a user query causes the retrieval.
The second is the output of a tool the agent calls. An external API that returns content the attacker controls. A search result from a search engine that the attacker has SEO'd into a top position. A scraped page returned by a fetch tool. The agent calls the tool, the tool result includes the injection, the model treats the result as part of its context for the next decision.
The third is the long-term memory store that some agents maintain. The agent writes notes to a persistent memory and reads them back across sessions. An attacker who can write to the memory store (through a vulnerability in the agent itself, or through a feature that lets multiple users share a memory) embeds an injection in the memory, and the agent executes it on the next session.
The surface area in production deployments
Three workflow patterns concentrate the indirect-injection risk.
Customer support automation that retrieves from a help-center corpus. The corpus contains pages the support team authored, pages that customers uploaded, and pages that integration partners syndicated. Any of those sources is a potential injection vector. A customer who can upload a document to a support ticket can embed an injection in the document. The agent retrieves the document on the next query, and the injection fires.
Enterprise search that retrieves from internal documents. The corpus contains documents from across the organization (wiki pages, Slack messages, email threads, file shares). Insider threats can place injections in any of these sources. The agent retrieves the documents on user queries and the injection fires.
Agentic browsers and computer-use agents that read web pages on behalf of the user. The web is an open corpus. Any page the agent visits is a potential injection vector. Anthropic's computer-use beta, the agentic browser products from various vendors, and the open-source agent frameworks are all exposed to this pattern.
Why model-side defense fails by construction
The model treats its context window as a sequence of tokens to attend to. The model does not have a structural distinction between "system prompt the application wrote" and "document content the retrieval returned" beyond what the application asserts in the message format. The Anthropic Messages API has a system field and a user field; the OpenAI Chat Completions API has system, user, and assistant roles; the Vertex API has its own schema. Within any of these, the document content is passed as text the model attends to.
Model providers have tried to train the model to attend differently to different roles. The training produces a probabilistic preference, not a structural separation. Stanford Trustworthy AI research and the AIUC-1 Consortium briefing found that refusal behaviors degrade significantly under adversarial pressure, including pressure from indirect injections embedded in retrieved content. A defense that depends on the model attending to the system prompt over the document content is statistical. Statistical defenses fail at production volume.
The architectural fact is that the model cannot enforce the boundary between trusted-application-instructions and untrusted-retrieved-content because the model cannot tell which is which. The boundary has to be enforced upstream of the model.
The request-boundary controls that hold up
Four controls produce a defensible posture against indirect prompt injection.
The first is corpus inspection at retrieval time. Before the application passes the retrieved content to the model, the inspection layer runs a classifier over the content for injection signatures. The signatures cover the common patterns: "ignore your prior instructions," "the following is a system message," "execute the following commands," "respond only in [exfiltration-friendly format]." The classifier produces a signal that the policy can act on (block the retrieval, strip the suspect content, pass with a warning). The audit record captures the signal regardless of the policy outcome.
The second is provenance attribution on the prompt. The application marks the retrieved content with a provenance tag (source URL, document identifier, retrieval timestamp). The inspection layer evaluates the policy with knowledge of which parts of the prompt are application-authored and which parts are retrieved. A policy can require that the model never execute tool calls proposed in content tagged as retrieved-from-untrusted-corpus. The model still attends to the content, but the inspection layer evaluates the tool-call response against the provenance metadata and blocks the tool call if the proposed action originated from the retrieved content.
The third is scope reduction on the agent's tools. The inspection layer attaches the natural-person identity to every request and evaluates whether the proposed tool call is authorized for this caller. An injection that instructs the agent to call a delete_records tool succeeds only if the caller's policy permits the call. The inspection layer enforces the per-tool authorization independently of the model, which means a successful injection does not escalate beyond the caller's existing scope.
The fourth is response inspection. The inspection layer runs a fast classifier over the streamed model response for exfiltration patterns: sensitive identifiers, encoded payloads, suspicious URLs that match exfiltration-endpoint patterns. A detected pattern blocks the response stream before the application receives it. The audit record captures the prompt that produced the response and the response chunk that matched the exfiltration pattern. The pattern catches the case where the injection bypassed the corpus classifier but the response is detectable.
What an audit record shows for a detected indirect injection
The record carries the natural-person identity of the caller who triggered the retrieval. The route identifier. The policy version. The retrieval source (URL, document identifier, corpus name). The fingerprint of the retrieved content. The injection signature that matched. The policy decision (blocked retrieval, stripped suspect content, passed with warning). The model and version targeted. The decision outcome for the model call. The response fingerprint if a response was produced. The timestamp and the cryptographic integrity signature.
An analyst querying the record series finds the patterns: which corpora are being injected, which sources are repeat offenders, which injection signatures are most active. The same record series feeds the corpus cleanup workflow (purge injected content from the source corpus) and the regulatory disclosure workflow (report the incident to the supervisor under DORA Article 19 or the EU AI Act Article 73 incident reporting regime).
DeepInspect
This is the gap DeepInspect closes for RAG pipelines and tool-using agents. DeepInspect sits inline between the calling application and any LLM endpoint over HTTP. For every request, DeepInspect runs the corpus classifier over the retrieved content the application passes in the prompt, evaluates the policy bundle against the natural-person identity and the corpus classification outcome, commits the per-decision audit record, and forwards the cleared request to the model. For responses, DeepInspect runs the response classifier on the streamed chunks and blocks responses that match exfiltration patterns the indirect injection was trying to produce.
The architecture handles the RAG pipeline (retrieved content reaches the model through the inspection layer's HTTP path), the tool-using agent (tool results pass through the inspection layer on the follow-up request), and the long-term memory case (memory reads cross the inspection layer's HTTP boundary). The audit record series captures every detected attempt, every policy decision, and every outcome in a format that the EU AI Act Article 12, Fannie Mae LL-2026-04, NIST AI RMF, and DORA Article 19 review accept.
If you are running RAG in production and the security review is asking how the application defends against injection in retrieved content, let's talk.
Frequently asked questions
- What sources are most commonly used for indirect prompt injection in production?
Three sources concentrate most of the production attack surface. Public web pages that an agentic browser visits. Customer-uploaded documents that a customer support agent retrieves. Internal documents in an enterprise search corpus where insider threats can place content. Other sources matter too (third-party feeds, search results, long-term memory stores) but the three above account for the majority of the incidents I see in the field. Each source has a different write surface and a different control point for the corpus operator to address.
- How does the inspection layer detect injection content in a retrieved document?
The inspection layer runs a classifier over the retrieved content at request time. The classifier matches against a maintained library of injection signatures: instructions to disregard prior instructions, instructions to assume a different persona, instructions to encode responses in exfiltration-friendly formats, instructions to call tools the caller is not authorized to invoke. The signatures evolve as new injection patterns are reported in OWASP LLM Top 10 updates, academic research, and incident response from production deployments. The classifier produces a signal the policy can act on. The pattern is the same as the direct-injection classifier, applied to the retrieved-content section of the prompt.
- Can the inspection layer block a tool call that originated from injected content in a retrieved document?
Yes, when the application marks the retrieved content with a provenance tag and the policy evaluates the proposed tool call with knowledge of the provenance. The model still attends to the retrieved content and may propose a tool call based on injected instructions. The inspection layer evaluates the proposed call against the provenance metadata: if the call originated from content tagged retrieved-from-untrusted-corpus, the policy can block the call. The architecture preserves the model's reasoning over the retrieved content while preventing the model's proposed actions from escalating beyond the caller's authorized scope.
- How does the architecture handle agentic browsers and computer-use agents that read web pages?
The same pattern applies. The agentic browser calls a fetch tool to read a web page. The fetch tool returns the page content to the agent. The agent's next request to the model carries the page content as context. The request flows through the inspection layer, which classifies the content for injection signatures and evaluates the policy bundle. A detected injection blocks the request or strips the suspect content. The architecture covers Anthropic's computer-use beta, OpenAI's Operator-style agents, and the open-source agent frameworks as long as the agent's model calls flow through the inspection layer's HTTP path.
- What does the audit record show for an indirect injection that was detected and blocked?
The record carries the natural-person identity of the caller, the retrieval source (URL, document identifier, corpus name), the fingerprint of the retrieved content, the injection signature that matched, the policy decision outcome (blocked retrieval, stripped suspect content, passed with warning), the model and version targeted, the timestamp, and the cryptographic integrity signature. An analyst querying the record series identifies repeat-offender corpora and feeds the corpus cleanup workflow. A compliance team querying the same record series produces the disclosure that the DORA Article 19 and the EU AI Act Article 73 incident reporting regimes expect.