LLM jailbreak
An LLM jailbreak is an attacker technique that gets a model to produce output the model's training was supposed to refuse. Common patterns include persona injection ("you are DAN now"), hypothetical framing ("in a fictional story where the AI must explain..."), token-level encoding (base64, leetspeak, foreign-language wrapping), gradient-based adversarial suffixes, and many-shot context flooding. Jailbreaks target the model's refusal behavior; prompt injection targets the application's intent. Both share OWASP LLM01 as the canonical category, and both reduce to the same mechanism: untrusted text reaches the context window and shifts the model's effective objective.
How jailbreak success rates evolve
Refusal training shifts with each model release, so a jailbreak technique that succeeded against last quarter's checkpoint may fail against this quarter's. The Stanford Trustworthy AI / AIUC-1 Consortium briefing summarized in March 2026 found that refusal behaviors of model-level guardrails were significantly degraded under targeted fine-tuning and adversarial pressure. The reader of an EU AI Act Article 12 audit is not interested in the success rate of the latest jailbreak under controlled testing. The reader is interested in whether the deployer can show what was attempted, what was blocked, and what was permitted, on each specific request in the timeframe under review.
Where the inspection layer fits jailbreaks into the audit record
A policy decision point at the AI request boundary classifies inbound prompts against known jailbreak patterns and produces a verdict the policy can act on. The policy can pass the request to the model, neutralize the payload, or block and fail closed. The audit record captures the classifier verdict and the policy decision regardless of whether the model itself would have refused. That separation matters because the audit reader can no longer rely on "the model refused" as the proof of compliance once the model behavior is probabilistic. The independent inspection record is the artifact that survives regulatory review under the EU AI Act, NIST AI RMF, and DORA logging obligations.
Related reading
- Jailbreaking LLMs: What the Attack Looks Like in Production and the Request-Boundary Defense That Holds Up
Jailbreaking is the class of attacks where adversarial prompts cause the model to disregard the safety training and produce content the provider intended to suppress. The attack catalog spans role-play framing, multi-step persuasion, encoded payloads, and the fine-tuning bypass that targets the refusal patterns directly. Stanford Trustworthy AI and the AIUC-1 Consortium research found that refusal behaviors degrade significantly under adversarial pressure. This piece walks through the attack patterns in production, why the model alone cannot defend, and the request-boundary controls and audit record format that produce a defensible posture.
- OWASP LLM01 Prompt Injection: The 2025 Update and What the Inspection Layer Enforces
OWASP LLM01 captures both direct and indirect prompt injection in a single category in the 2025 update. The architectural reason is that the control point is the same: the request boundary. Application-side defenses fail by construction because the application cannot tell which spans of the prompt the model treats as instructions. Model-side defenses fail because refusal training is probabilistic. This piece walks through the LLM01 attack surface, the inspection-layer controls that produce a defensible posture, the audit record that survives review under EU AI Act Article 12 and DORA Article 19, and the deployment pattern that fits a production AI stack.
- Prompt Injection in Production: Where It Happens, What It Costs, and How To Prevent It at the Request Boundary
Prompt injection is the class of attacks where adversarial content in a prompt overrides the application instructions or extracts data the model was not authorized to reveal. The attack surface includes direct user prompts, indirect injection through retrieved documents and tool results, and chained injection through agent loops. OWASP has consistently ranked prompt injection as the top LLM vulnerability. This piece walks through the attack mechanisms in production, the failure modes of model-side defenses, the request-boundary controls that produce a defensible posture, and the audit record format that holds up after an attempt is detected.