Prompt Injection vs Jailbreak: Where the Two Attack Classes Diverge and What the Inspection Layer Enforces
Prompt injection and jailbreaking are distinct attack classes that public discussion often conflates. Jailbreaking targets the model provider safety training to produce content the provider intended to suppress. Prompt injection targets the application context boundary to override the application instructions or exfiltrate organization data. The defenses sit at different architectural layers. This piece walks through the distinction, where each defense layer fires, and the inspection-layer pattern that addresses both.

Prompt injection and jailbreaking are distinct attack classes that public discussion often treats as the same thing. The distinction matters because the defenses sit at different architectural layers and produce different audit records. Jailbreaking targets the model provider's safety training to produce content the provider intended to suppress. Prompt injection targets the application context boundary to override the application's instructions, exfiltrate the organization's data, or cause connected tools to act outside the user's authorization. A defense that addresses one without the other leaves a real attack surface.
I want to walk through the architectural distinction, where the model-provider defenses fire, where the application defenses fire, and the inspection-layer pattern that addresses the attack classes that fall between them.
The architectural distinction
Jailbreaking attacks the model provider's intent. The provider trained the model to refuse certain content categories: instructions for weapons synthesis, self-harm content, child safety violations, content the provider's policy excludes. The jailbreak payload causes the model to produce content from those categories despite the training. The harm is the production of the prohibited content. The defense is the provider's safety training, the post-training guardrails, and the abuse-detection systems the provider runs.
Prompt injection attacks the application's intent. The application defined a system prompt, a user role, an organizational policy, and a set of permitted data classes. The injection payload causes the model to violate the application's intent: leak the system prompt, return data the policy disallows, issue tool calls the user never approved. The harm is the organization-specific policy violation. The defense is the application's policy enforcement at the request boundary.
The same payload can fire both attack classes at the same time. A jailbreak that extracts a controlled-substance recipe through a healthcare-assistant application is both a jailbreak (the model produced content the provider tries to suppress) and a prompt injection (the application's policy disallows the response category). The defenses still sit at different layers.
Where the model provider's defenses fire
The model provider's defenses operate inside the inference process and on the surrounding service infrastructure. The training fine-tunes the model toward refusal of prohibited content. The post-training guardrails apply moderation classifiers to the prompt and the response. The abuse-detection systems flag accounts whose query patterns match known jailbreak campaigns.
The Stanford Trustworthy AI and the AIUC-1 Consortium briefing summarized by Help Net Security found that the refusal behaviors degrade under role-reversal framing, multi-step persuasion, encoded payloads, and adversarial fine-tuning. The defenses are probabilistic. They reduce the rate of jailbreak success. They do not enforce the enterprise's specific policy.
I covered the distinction in the jailbreaking defense analysis. The model provider owns the jailbreak defense at the model layer. The enterprise owns the policy enforcement at the application layer.
Where the application defenses fire
The application defenses operate inside the application process. The application validates the user input, constructs the prompt, applies content filters at chosen junctions, and parses the model output. The defenses run under the same custody as the rest of the application code. The application can modify, disable, or fail to commit the defense's verdict.
I argued the position in the audit trail analysis. Application-controlled defenses are self-attestation. The EU AI Act Article 12 and DORA Article 19 reviewers expect records from outside the application's custody. The application defenses are useful as a first pass. They are not sufficient as the only defense.
Where the gap sits between the two defense layers
The gap is the attack class that the provider's defenses do not address because the harm is enterprise-specific, and that the application's defenses do not address from a custody position the regulator will accept. Examples:
- A payload that causes the model to leak the organization's internal policy. The provider's safety training does not classify the organization's policy as protected. The application's filter ran inside the application's process and produced no externally auditable record.
- A payload that causes a connected tool to issue an action outside the user's authorization. The provider's training has no concept of the user's role. The application's tool dispatcher checked the schema match, not the policy match.
- A payload that exfiltrates PII or PHI inside a benign-looking response. The provider's safety training does not classify the data class as a refusal trigger. The application's output parser checked the schema, not the content classification.
These attack classes fall between the two layers. The inspection layer at the HTTP boundary closes the gap.
What the inspection layer does for both attack classes
The inspection layer at the HTTP path between the application and the model evaluates every request and every response against per-route, per-role policies. For the jailbreak class, the layer adds an enterprise-side moderation pass that fires before the request reaches the model and after the response returns. The pass catches jailbreak payloads that the provider's defenses would have caught only with a probabilistic delay or not at all. For the prompt injection class, the layer enforces the application's policy from outside the application's custody.
The layer produces a per-decision audit record for every evaluation. The record contains the identity, the role, the prompt content (with sensitive spans redacted per policy), the policy version, the decision outcome, and a cryptographic signature. The record is committed before the model receives the request or before the application acts on the response.
The inspection layer does not replace the model provider's safety training or the application's defenses. The three layers form defense in depth. The inspection layer is the layer that produces the deterministic, identity-bound, externally auditable decision and the audit record the regulator will accept.
Why the distinction matters for compliance posture
EU AI Act Article 12 and DORA Article 19 frame their requirements around per-decision audit records. The records must reconstruct the decision the AI system made, identify the natural person involved, and survive regulatory review. The records are not about the rate of jailbreak success or the rate of prompt injection compliance. The records are about the specific decisions the system made and the policy that governed each one.
A defense posture that addresses jailbreaks (the model provider's domain) but does not produce per-decision records misses the compliance obligation. A defense posture that addresses prompt injections (the application's domain) but stores the evaluation results inside the application's database misses the audit independence requirement. The inspection layer addresses both functions: deterministic policy evaluation at the boundary and audit records from outside the application's custody.
DeepInspect
This is the architecture DeepInspect was built to provide. DeepInspect sits inline at the HTTP path between the application and any LLM. The inspection layer evaluates every request and every response against per-route, per-role policies, applies redact or block actions where policy dictates, and commits a per-decision audit record signed and stored independently of the application.
DeepInspect's policy primitives address both attack classes. The jailbreak-class checks classify the prompt and the response against the organization's content policy. The prompt-injection-class checks evaluate the prompt for instruction-override patterns, role-reversal framing, indirect injection sources, and authority impersonation. The audit record names the pattern hit, the policy that fired, and the outcome.
If your AI deployment relies on the provider's safety training plus application-side filters, the gap between the two layers is the residual attack surface. Run the free AI Readiness Check to see where the gaps sit in your stack.
Frequently asked questions
- Is the OWASP LLM01 entry for prompt injection or for jailbreaks?
OWASP LLM01 is the Prompt Injection category and explicitly covers both direct and indirect injection. The 2025 update consolidated the categories because the control point is the same: the HTTP request boundary. The OWASP catalog separately tracks jailbreaking under the broader Sensitive Information Disclosure and Misuse categories where the model produces provider-prohibited content. The OWASP framing matches the architectural distinction in this piece: prompt injection is the application-boundary attack class and jailbreaks are the provider-policy attack class.
- Does ChatGPT Enterprise or Claude Enterprise address both classes?
The Enterprise tiers add data-handling commitments, SSO, and admin controls. They use the same models as the consumer tiers, with the same safety training. The jailbreak defense is the same. The prompt injection defense remains the application's responsibility because the Enterprise tier does not enforce the enterprise's specific policy. The audit log the Enterprise tier produces covers user activity at the application level; it does not cover the per-decision policy evaluation the regulator expects.
- What about the model provider's content moderation API?
OpenAI's Moderation API, Anthropic's content classifier, and Google's Vertex AI safety filters apply provider-defined content categories. The categories cover the jailbreak class. They do not cover the enterprise-specific prompt injection class because the providers cannot know the enterprise's policy, the user's role, or the data classification rules. The inspection layer at the HTTP boundary applies the enterprise-specific policy and produces the audit record.
- How does the inspection layer interact with red-team testing?
Red-team testing exercises the model provider's defenses and the application's defenses against catalogued payloads. The exercise produces a measure of the defense's effectiveness against known patterns. The inspection layer is a runtime defense, not a test. The two functions complement each other. Red-team results inform the inspection layer's policy updates. The inspection layer enforces the updated policies at every production request.
- Can I treat all prompt-related attacks the same way operationally?
No. The runbook for a jailbreak-class incident differs from the runbook for an injection-class incident. A jailbreak that produced provider-prohibited content triggers a notification to the provider, a review of the abuse-detection signal, and possibly a policy update on the inspection layer to catch the payload pattern earlier. An injection that caused a connected tool to issue an unauthorized action triggers a review of the tool authorization, the user's identity context, and the policy that governed the action. The two operational responses share the audit record format but follow different paths.