RLHF | DeepInspect Glossary

RLHF

RLHF stands for Reinforcement Learning from Human Feedback. A pretrained language model is fine-tuned against a reward model that was itself trained on human preference comparisons over model outputs. The technique was popularized by OpenAI's InstructGPT paper in 2022 and underlies ChatGPT, Claude, and most production assistant models. RLHF shapes the model's response distribution toward outputs humans prefer, including refusals on harmful or off-policy prompts. The effect on the deployed model is probabilistic. RLHF moves the average behavior; it does not produce a bounded enforcement guarantee on any single inference.

How RLHF shapes refusal behavior

The reward model rates candidate responses, and the policy is optimized to maximize expected reward. Annotators flag harmful, biased, or off-policy outputs during the preference labeling step, and the model learns to avoid those completions. Constitutional AI and direct preference optimization are related techniques that produce similar effects through slightly different training pipelines. Refusal is the most visible behavior the post-training stack produces, and it is the behavior most often probed by adversarial researchers because a failed refusal is observable in a single API call.

Why RLHF cannot serve as an enterprise enforcement boundary

The Stanford Trustworthy AI / AIUC-1 Consortium briefing summarized in March 2026 reported that RLHF-induced refusal behaviors were significantly degraded under targeted fine-tuning and adversarial pressure. Inside an enterprise deployment, the consequence is that the model's refusal is not a control the audit log can rely on. A regulator reviewing an EU AI Act Article 12 incident asks the deployer to show what the system did with a specific request at a specific moment. The answer "the model usually refuses these" does not survive that question. The enforcement layer that does is an external policy decision point that produces a deterministic verdict and a per-decision audit record independent of the model's runtime behavior.

Related reading

Jailbreaking LLMs: What the Attack Looks Like in Production and the Request-Boundary Defense That Holds Up
Jailbreaking is the class of attacks where adversarial prompts cause the model to disregard the safety training and produce content the provider intended to suppress. The attack catalog spans role-play framing, multi-step persuasion, encoded payloads, and the fine-tuning bypass that targets the refusal patterns directly. Stanford Trustworthy AI and the AIUC-1 Consortium research found that refusal behaviors degrade significantly under adversarial pressure. This piece walks through the attack patterns in production, why the model alone cannot defend, and the request-boundary controls and audit record format that produce a defensible posture.
Prompt Injection in Production: Where It Happens, What It Costs, and How To Prevent It at the Request Boundary
Prompt injection is the class of attacks where adversarial content in a prompt overrides the application instructions or extracts data the model was not authorized to reveal. The attack surface includes direct user prompts, indirect injection through retrieved documents and tool results, and chained injection through agent loops. OWASP has consistently ranked prompt injection as the top LLM vulnerability. This piece walks through the attack mechanisms in production, the failure modes of model-side defenses, the request-boundary controls that produce a defensible posture, and the audit record format that holds up after an attempt is detected.
OWASP LLM01 Prompt Injection: The 2025 Update and What the Inspection Layer Enforces
OWASP LLM01 captures both direct and indirect prompt injection in a single category in the 2025 update. The architectural reason is that the control point is the same: the request boundary. Application-side defenses fail by construction because the application cannot tell which spans of the prompt the model treats as instructions. Model-side defenses fail because refusal training is probabilistic. This piece walks through the LLM01 attack surface, the inspection-layer controls that produce a defensible posture, the audit record that survives review under EU AI Act Article 12 and DORA Article 19, and the deployment pattern that fits a production AI stack.