Inline enforcement

Inline enforcement is the architectural mode where a policy decision sits inside the request path between an authenticated caller and an LLM endpoint. Every request is evaluated synchronously, and a fail-closed proxy returns either pass or block before the request reaches the model. The evaluation uses identity context, data classification, and per-route rules. Out-of-band monitoring sees the prompt only after the model has already responded, so the audit trail records what happened but the request itself already completed.

How inline enforcement works

The proxy terminates the client TLS connection, decrypts the prompt payload, runs the policy decision in the request path, and either forwards the request to the model or returns a block. The decision time stays under 50 ms in the published DeepInspect benchmark so the user-facing latency stays inside the budget that production AI applications already accept. Mandiant's M-Trends 2026 report measured a 22-second median between initial access and handoff to a secondary threat group, which is the operational reason out-of-band detection fails as a control.

Fail-closed is the architectural property that matters. When the policy decision point fails to reach a definitive decision (policy lookup error, identity claim missing, classification model timing out), the request gets blocked rather than passed through. EU AI Act Article 12 traceability obligations and NIST AI RMF map and measure functions both reference per-request enforcement evidence, and inline enforcement is what produces that evidence.

Inline enforcement

How inline enforcement works

Related reading