Why is the model-side safety training not enough?

The model-side safety training is probabilistic. Stanford Trustworthy AI research documented that refusal behaviors degrade significantly under targeted fine-tuning and adversarial pressure. The training catches a class of attacks and misses others. The architectural property that matters is that the safety training lives inside the inference process and the enterprise has no enforcement guarantee, no independent record, and no policy decision point. Anything that depends on the model refusing is a gap. The external inspection points (request-path, response-path, downstream action) cover the gap with deterministic policy and independent audit records.

Does defense in depth add unacceptable latency?

The two external inspection points the proxy runs (request and response) add enforcement overhead under 50 milliseconds in internal DeepInspect testing. The model inference latency runs 500 milliseconds to 5 seconds. The proxy overhead is under 10% of the inference budget and frequently under 2%. The user-perceived latency is dominated by the model. The defense-in-depth architecture does not introduce a meaningful latency cost beyond the single-layer architecture.

How do we tell which layer caught an attack?

The audit record covers each inspection point separately and reconciles on the conversation identifier and the turn identifier. An incident review reads the request-path record, the response-path record, and (where available) the model-side log, and reconstructs which layer fired and which layer the attack bypassed. The reconciliation is the evidence the EU AI Act Article 12 obligation expects, and the data NIST AI RMF Manage function uses for incident response.

What about prompt injection through retrieved documents?

Indirect prompt injection through retrieved documents bypasses the request-path classifier because the classifier reads the prompt the caller sent, not the documents the RAG pipeline retrieves. The architectural answer is two-fold: the retrieval layer applies its own classification on documents before they land in the context window, and the response-path classifier catches the successful injection by signal in the model response. The defense-in-depth coverage extends to the retrieval boundary as a fourth inspection point in deployments that take the residual risk seriously.

Does the defense-in-depth pattern work with streaming responses?

The proxy handles streaming responses by buffering the stream up to a policy-defined chunk size, running the response-path classifier on each chunk, and forwarding the chunk to the caller after the policy decision. The chunk size determines the latency budget and the inspection depth. A smaller chunk size produces lower latency and shallower inspection per chunk. A larger chunk size produces deeper inspection at the cost of higher first-token latency. The deployment configures the trade-off against the user experien

Prompt Injection Defense in Depth: The Three Inspection Layers That Compose

OWASP ranks prompt injection as LLM01 in the LLM Top 10, and the ranking has held since the list first published. Prompt injection covers any input that subverts the model's intended behavior: a user message that overrides the system prompt, a retrieved document that smuggles an instruction into the context window, a tool output that carries an instruction the model executes on the next turn. No single defensive layer catches every attack. The deployment that survives in 2026 composes three inspection layers: request-path classification at the AI request boundary, model-side safety training inside the inference process, and response-path inspection on the return path. The three layers see different attacks. The combination produces stronger coverage than any layer in isolation.

I want to walk through what each layer actually inspects, where each one is blind, and how the per-decision audit record reconciles the decisions across layers.

The three layers and what each one sees

The request-path layer sits at the AI request boundary. The inspection point reads the JSON request body before it forwards to the model, classifies the prompt content, and applies policy that decides pass, redact, or deny. The layer sees the prompt as a structured field and runs classifiers tuned for injection patterns.

The model-side safety layer lives inside the inference process. The model provider trains the model on refusal patterns through RLHF, constitutional AI, and fine-tuning on adversarial examples. The layer sees the prompt as the model processes it and shapes the response according to the safety training.

The response-path layer sits on the return path. The inspection point reads the model response before it returns to the application, classifies the response content, and applies policy that decides pass, rewrite, or deny. The layer sees signals that an injection succeeded: sudden language switches, scope departures, execution of instructions the original prompt did not contain.

The three layers compose into defense in depth. An attack that passes the request-path classifier may fail at the model-side layer. An attack that passes both may fail at the response-path layer. An attack that passes all three is a residual risk that the deployment quantifies and accepts with compensating controls (human review, tighter use-case scoping, smaller blast radius per call).

Where the request-path layer is blind

The request-path classifier reads the prompt the caller sent. The classifier does not read documents the model retrieves during the call, tool outputs that arrive mid-conversation, or context the model loads from connected stores. Three classes of injection consequently bypass the request-path layer.

The first class is indirect prompt injection through retrieved documents. The user asks a benign question. The RAG pipeline retrieves a document that contains a hidden instruction. The model reads the document, executes the instruction, and the user receives content the original prompt never asked for. The request-path classifier saw a clean question.

The second class is tool-output injection. The model calls a tool (a web fetch, a database query, an API call to an internal service). The tool returns content that contains an instruction. The model reads the tool output and executes the instruction. The request-path classifier saw a clean prompt and a tool call that looked authorized.

The third class is multi-turn injection that staged across earlier turns the request-path classifier already passed. A user message at turn 3 establishes a context. A user message at turn 8 references the established context in a way that subverts the original system prompt. Each turn looked benign in isolation, but the composition over the conversation produced the attack.

The architectural answer for each class is upstream defense at the retrieval layer (the document store applies its own classification before content lands in the context window), the tool boundary (the tool's response is inspected), or the response-path layer that catches the successful injection by signal.

Where the model-side safety layer is blind

The model-side safety layer is probabilistic. The Stanford Trustworthy AI research and the AIUC-1 Consortium briefing, developed with CISOs from Confluent, Elastic, UiPath, and Deutsche Borse alongside researchers from MIT Sloan, Scale AI, and Databricks, documented that refusal behaviors of model-level guardrails were significantly degraded under targeted fine-tuning and adversarial pressure.

Three classes of attack systematically degrade the model-side layer. The first is jailbreak-style framing that reframes the prompt as a role-play, a hypothetical, or a fiction-writing exercise. The second is gradual escalation across the conversation that incrementally weakens the model's refusal posture. The third is adversarial fine-tuning of the model itself, which is available for open-weight models and for some provider-hosted fine-tuning APIs.

The model-side layer is part of defense in depth and not a substitute for it. The architectural property the layer carries is that it operates inside the inference process and the enterprise has no enforcement guarantee, no independent record, and no policy decision point. Anything that depends on the model refusing is a gap that needs an external layer to back it.

Where the response-path layer is blind

The response-path classifier reads the model response. The classifier does not read the model's hidden chain of thought (when the model exposes one), the model's reasoning traces in models that hide them, or downstream actions the application takes after the response (tool calls the application initiates, database writes the application performs, downstream API calls).

The response-path layer catches injection successes that surface in the response text. The layer misses injection successes that influence downstream action without surfacing in the response. The architectural answer is an enforcement point at the downstream action boundary: a control on outbound API calls the application makes after the model response, a control on database writes the application performs, a control on tool calls the agent runtime issues. Each is a separate inspection point.

The response-path layer also misses streaming-response injections that arrive in chunks if the chunk-level inspection cannot reconstruct the full intent of the response within the chunk size budget. The architectural trade-off is the latency budget the streaming deployment carries.

The audit record that reconciles across layers

A per-decision audit record covers each inspection point: one record for the request-path decision, one record for the response-path decision, and (where the deployment captures it) a record for the model-side safety log the provider exposes. The three records reconcile on the conversation identifier, the turn identifier, and the timestamp.

The reconciliation matters for incident response. A failed inspection at one layer is visible in the record. The deployer can trace which layer caught the attack, which layer it bypassed, and what the residual risk was. The IBM Cost of Data Breach Report 2026 reported a 247-day detection window for shadow AI breaches. A defense-in-depth architecture with a unified audit record reduces the detection window to the moment the response-path layer or the downstream action layer fires.

What the 2026 compliance set expects

EU AI Act Article 9 requires a risk management system that identifies, estimates, evaluates, and treats risks across the lifecycle. Prompt injection is one of the risk classes the obligation expects to see addressed. Defense in depth across the three layers is the operational answer. The August 2, 2026 deadline applies.

Article 12 requires automatic logging over the system lifetime. The defense-in-depth audit reconciles across the three layers. Article 15 requires accuracy, resilience, and cybersecurity, and the reading of resilience includes the ability to withstand adversarial inputs. Prompt injection is the most studied class of adversarial input.

NIST AI RMF Manage function expects incident response evidence. The reconciled audit record is the evidence the function calls for. ISO 42001 clause 8.2 expects operational controls that perform as intended. Defense in depth across the three layers is the operational pattern that satisfies the clause.

Where most deployments are exposed

Most enterprise AI deployments today operate one of the three layers and rely on the model-side safety training for the other two. The deployment with no request-path classifier and no response-path classifier accepts the residual risk of every model-side failure as a deployment failure.

The Cloud Radix figure of 86% IT leader blindness applies here. The deployment that has no inspection point at the AI request boundary cannot quantify its prompt injection exposure. The Netwrix figure of 37% of organizations with any AI governance policy applies at the upstream side. Without policy, the inspection points have no decision rules to apply.

The architectural answer is to deploy all three layers and reconcile the audit record. The cost of the inspection points is recoverable. The cost of an undetected injection is not.

DeepInspect

This is exactly what DeepInspect provides on the request and response paths. DeepInspect sits at the AI request boundary as an identity-aware proxy, runs an injection-detection classifier on the prompt body before the request reaches the model, and runs a separate classifier on the model response on the return path. Both decisions produce per-decision audit records that the application has no write path to.

The injection classifiers operate alongside the data classification, identity, role, and use-case policy gates the same proxy runs. A request that fails the injection check and the PHI authorization check produces a single composite decision with both reasons logged. The audit record carries the full decision context.

Enforcement overhead runs under 50 milliseconds in internal DeepInspect testing, against LLM inference latency of 500 milliseconds to 5 seconds. The two inspection points (request and response) share the same proxy hot path. The defense-in-depth coverage on the request and response side does not add a second proxy or a second audit store.

The model-side safety layer remains the model provider's responsibility. The third layer (downstream action enforcement on tool calls, database writes, and outbound API calls) is the agent runtime's responsibility. DeepInspect covers the two layers that operate at the LLM request boundary and reconciles the audit record across both.

If your deployment relies only on the model's safety training to catch prompt injection, book a technical deep dive at deepinspect.ai.