← Blog

OWASP LLM01 Prompt Injection: The 2025 Update and What the Inspection Layer Enforces

OWASP LLM01 captures both direct and indirect prompt injection in a single category in the 2025 update. The architectural reason is that the control point is the same: the request boundary. Application-side defenses fail by construction because the application cannot tell which spans of the prompt the model treats as instructions. Model-side defenses fail because refusal training is probabilistic. This piece walks through the LLM01 attack surface, the inspection-layer controls that produce a defensible posture, the audit record that survives review under EU AI Act Article 12 and DORA Article 19, and the deployment pattern that fits a production AI stack.

ByParminder Singh· Founder & CEO, DeepInspect Inc.
Problem-Awareowaspllm01prompt-injectionllm-securityinline-enforcementaudit-logs
OWASP LLM01 Prompt Injection: The 2025 Update and What the Inspection Layer Enforces

OWASP LLM01 in the 2025 update covers direct prompt injection (the user typing instructions that override the system prompt) and indirect prompt injection (the model reading injected content from a retrieved document, a tool result, or a long-term memory store) under a single category. The reorganization reflects what production teams reported across two years of incident response: the architectural control point for both attack types is the same. The request boundary is where the inspection has to sit. The model cannot enforce the boundary because the model attends to its context window as a sequence of tokens to weigh, not as a structurally separated set of sources.

I want to walk through the LLM01 attack surface in production AI stacks, why model-side and application-side defenses fall short, the inspection-layer controls that produce a defensible posture, the audit record that survives review under the EU AI Act Article 12 and DORA Article 19, and the deployment pattern that fits a production AI stack.

The LLM01 attack surface in production

Direct injection sits in the request the application sends to the model. The user typed a message into the application, the application passed the message through to the model's prompt, and the message contained instructions that conflict with the system prompt. Most chat-style applications are exposed by construction because the user text reaches the model.

Indirect injection sits in content the model reads on the way to producing its response. Three sources concentrate the indirect-injection volume. The first is the corpus a RAG pipeline indexes: customer-uploaded documents, partner-syndicated content, public web pages the corpus crawls. The second is the output of a tool the agent calls: an external API result, a search engine result, a scraped page. The third is the long-term memory store an agent maintains across sessions.

A combined injection sits across both surfaces. The user types a benign question, the application retrieves a document the attacker injected, and the model's next response executes the injected instructions while the application's input filter sees a clean user message.

Why application-side defenses fail

Application-side defenses fall into two patterns. The first is input sanitization: a filter over the user-typed text that strips suspect strings. The pattern fails because the attacker can encode the injection (base64, ROT13, language-of-the-target-model), can reword it (the model attends to semantics, not literal strings), and can split it across a retrieved document the input filter does not inspect.

The second is output filtering: a filter over the model's response that catches dangerous tokens. The pattern fails because the model has already executed the injected instruction by the time the response is filtered. A successful injection that exfiltrated a secret produces a response the filter cannot reverse. A successful injection that caused the agent to call a destructive tool produces a side effect the response filter cannot undo.

The architectural fact is that the application's defense sits inside the application code path, where the data the application has access to is the data the application's developers exposed to the filter. The injection sits in data the filter did not expect to inspect.

Why model-side defenses fail

Model providers train models to attend differently to system prompts and user content. The Anthropic Messages API has a system field and a user field. The OpenAI Chat Completions API has system, user, and assistant roles. The Vertex API has its own schema. The model sees these as structural hints, not as a hard separation.

Refusal training produces a probabilistic preference for following the system-prompt instructions over user-content instructions when the two conflict. Stanford Trustworthy AI and the AIUC-1 Consortium briefing found that refusal behaviors degrade significantly under adversarial pressure, including pressure from indirect injections embedded in retrieved content. Refusal training cannot produce a structural separation because the model has no architectural concept of "trusted source" vs "untrusted source."

The boundary between trusted application instructions and untrusted retrieved content has to be enforced upstream of the model.

The inspection-layer controls that hold up

Four controls at the inspection layer produce a defensible posture against LLM01.

The first is prompt classification at the request boundary. Before the request reaches the model, a classifier runs over the prompt content and identifies injection signatures. The signature library covers instructions to disregard prior context, instructions to assume a different persona, instructions to encode responses in exfiltration-friendly formats, and instructions to call tools the caller is not authorized to invoke. The classifier produces a deterministic signal the policy can act on.

The second is provenance attribution on retrieved content. The application marks the retrieved content with a provenance tag (source URL, document identifier, retrieval timestamp). The policy evaluates the request with knowledge of which spans are application-authored and which spans are retrieved. A stricter rule applies to spans from untrusted corpora.

The third is per-tool authorization at the request boundary. The natural-person identity of the caller attaches to every request. The policy evaluates whether the proposed tool call is authorized for the caller. A successful injection that proposes an unauthorized tool call fails the policy at the inspection layer before the application executes the tool.

The fourth is response inspection. A classifier runs over the streamed response chunks and matches against exfiltration patterns (sensitive identifiers, encoded payloads, suspicious URLs). A detected pattern blocks the response stream before the calling application receives it. The architecture catches the case where the injection bypassed the prompt classifier but the response signal is detectable.

The audit record that survives review

The audit record the inspection layer commits for each LLM01-relevant request carries the timestamp, the natural-person identity of the caller, the route identifier, the policy version, the model and version targeted, the request fingerprint, the response fingerprint, the classifier signals (prompt-injection score, retrieval-source classification, response-exfiltration score), the policy decision, and the cryptographic integrity signature. The record persists in a store the application cannot modify.

The EU AI Act Article 12 expects "automatic recording of events (logs) over the lifetime of the system" sufficient to ensure traceability. The per-decision record covers the traceability dimension explicitly: which user submitted which prompt against which model with which policy state, and what the inspection layer did about it. DORA Article 19 expects evidence of major incident detection and response in the financial-services scope. The same record series produces the incident notification artifact. The Fannie Mae LL-2026-04 disclosure-on-demand obligation for mortgage lenders deploying AI reads against the same record series.

The deployment pattern

The inspection layer integrates as a single HTTP hop. The application code keeps the same SDK calls (the OpenAI SDK, the Anthropic SDK, the Vertex SDK). The base URL points at the inspection layer instead of pointing directly at the provider. The inspection layer attaches identity context to every request from the calling JWT or service token, runs the classifier and the policy bundle, forwards the cleared request to the model provider, runs the response classifier on the streamed response, and commits the audit record.

The deployment latency overhead measures under 50 ms in internal testing against an LLM inference baseline of 500 ms to 5 seconds. The overhead is dwarfed by the inference time and falls inside the 22-second window that machine-speed attacks operate within (Google Mandiant M-Trends 2026).

The architecture covers the OpenAI, Anthropic, Vertex, and Bedrock endpoints, the agent frameworks built on top (LangChain, LlamaIndex, the Anthropic Computer Use beta, the OpenAI Operator pattern), and the retrieval pipelines the agents consume. A new approved model gets added to the policy bundle. A deprecated model gets removed. An endpoint that fails security review gets blocked. The application code does not change.

DeepInspect

This is the gap DeepInspect closes for LLM01. DeepInspect sits inline between the calling application and any HTTP LLM endpoint. For every request, DeepInspect runs the prompt-injection classifier over the prompt content, runs the retrieved-content classifier over any spans tagged with retrieval provenance, evaluates the per-tool authorization against the caller's identity, runs the response classifier on the streamed response, and commits the per-decision audit record. The audit record series carries the evidence the EU AI Act Article 12, DORA Article 19, Fannie Mae LL-2026-04, and the NIST AI RMF reviewer accept.

The architecture covers direct injection, indirect injection through RAG pipelines, indirect injection through tool results, and indirect injection through long-term memory stores in the same control point. The signature library evolves as OWASP, academic research, and incident response surface new patterns. The deployment integrates as a single HTTP hop.

If you are running an LLM in production and the security review is asking how the application defends against LLM01 prompt injection, let's talk.

Frequently asked questions

Why did OWASP merge direct and indirect prompt injection into a single LLM01 category in the 2025 update?

The architectural control point is the same. Both direct and indirect injection succeed because the model attends to its context window as a sequence of tokens with no structural separation between trusted application instructions and untrusted external content. The control has to sit at the request boundary, where the inspection layer can classify the prompt content and act on the classification before the request reaches the model. Splitting the category across two items in the prior list suggested two different controls were needed. The 2025 update reflects the operational reality.

How does the inspection layer classify retrieved content for injection signatures without breaking legitimate RAG retrieval?

The classifier produces a signal the policy can act on, not a hard block. The policy bundle decides how to act on the signal based on the caller's identity, the retrieval source, the data class, and the application's risk tolerance. A signal on content from a high-trust corpus might trigger a logged warning. The same signal on content from a user-uploaded document might trigger a hard block. The architecture preserves the legitimate RAG flow and adds a deterministic control point.

Can the inspection layer block a tool call that originated from injected instructions in a retrieved document?

Yes, when the application marks the retrieved content with a provenance tag and the policy evaluates the proposed tool call against the provenance metadata. The model still attends to the retrieved content and may propose a tool call based on injected instructions. The inspection layer evaluates the proposed call: if the call originated from content tagged retrieved-from-untrusted-corpus, the policy blocks the call. The architecture preserves the model's reasoning over the retrieved content while preventing the model's actions from escalating beyond the caller's authorized scope.

How does the per-decision audit record support the EU AI Act Article 12 traceability obligation for an LLM01 incident?

The record carries the timestamp, the natural-person identity of the caller, the route, the policy version, the model and version, the prompt fingerprint, the response fingerprint, the classifier signals, the policy decision, and the cryptographic integrity signature. An auditor querying the record series for the incident reconstructs which user submitted which prompt against which model with which policy state, and what the inspection layer did about it. The Article 12 traceability language reads against this record series directly. The write-path independence of the inspection layer (the application cannot modify the record) satisfies the auditor's question about evidence integrity.

What is the latency overhead of the inspection layer for LLM01 controls in production?

Internal DeepInspect testing measures end-to-end enforcement overhead under 50 ms against an LLM inference baseline of 500 ms to 5 seconds. The overhead covers the prompt classifier, the policy evaluation, the audit record commit, and the response classifier on the streamed chunks. The figure falls inside the 22-second window the Google Mandiant M-Trends 2026 report identified for machine-speed attack handoffs and is dwarfed by the inference time itself.