How does the classifier decide what counts as PHI or PII?

The classifier runs the response through a pipeline that combines pattern matching for known shapes (SSN, NPI, MRN, account number formats), named-entity recognition for person and address content, and policy-defined regular expressions for organization-specific data classifications. The classifier output is a labeled segmentation of the response: each span carries one or more data classifications and a confidence score. The policy decision uses the classifications and the identity context to decide pass, rewrite, or block. The classifier is deterministic given the same input, the same model version, and the same policy.

What about prompt-injection attempts that surface in the response?

The classifier looks at signals that the model executed an instruction from a retrieved document or tool output: sudden language switches, departures from the conversational scope, executions of instructions the original prompt did not contain. The signal carries known false-positive rates. The architectural answer is defense in depth: the response-path classifier catches a subset of prompt-injection successes that the request-path classifier and the model's own safety training did not catch. The combination of request inspection, model safety, and response inspection produces stronger coverage than any layer in isolation.

Does the return-path step apply to streaming responses?

The proxy handles streamed responses by buffering the stream up to a policy-defined chunk size, running the classification on the buffered chunk, and forwarding the chunk to the caller after the policy decision. The buffering adds latency proportional to the chunk size, and a streaming-optimized deployment uses chunk sizes that match the user-experience target. The audit record covers the full streamed response and the per-chunk decisions taken along the way.

How does response redaction differ from output filtering in the model itself?

Model-side output filters (the safety layers the model provider trains into the model) operate inside the inference process and are probabilistic. They degrade under fine-tuning, adversarial prompting, jailbreaks, and role-play framing. They are part of defense in depth but they are not an enforceable control. Response redaction in the proxy operates outside the inference process, applies deterministic policy, and produces an independent audit record. The two layers compose: the model's safety training catches a class of failures, the response redaction layer catches a different class, and the deployment policy enforce

AI Response Redaction: The Return-Path Inspection Step Most LLM Deployments Skip

AI response redaction inspects the model output on the return path before it reaches the calling application or end user and rewrites or blocks any segment that fails policy. Most LLM deployments inspect prompts on the request path and skip the return-path step, which leaves the deployment exposed to model-reconstructed PII, retrieval-amplified PHI disclosures, and the prohibited-output failure modes the EU AI Act Article 13 transparency obligation covers. The IBM Cost of Data Breach Report 2026 found that 65% of shadow AI breaches involved customer PII exposure, compared to 53% across all breaches. The return path is where most of that exposure becomes visible.

I want to walk through what response redaction actually evaluates, where it sits in the AI gateway pattern, and what the 2026 compliance set expects from the return-path control.

The return-path problem

A prompt-only inspection model assumes that if the input is clean the output will be clean. The assumption holds for narrow use cases (deterministic classification, structured extraction with tight schemas, function-call routing) and breaks for the use cases that drive most enterprise AI value (open-ended generation, retrieval-augmented generation, agentic workflows that compose multi-step model calls).

Three return-path failure modes recur:

The first is model-reconstructed sensitive content. The model regenerates PII or proprietary content from training data without any sensitive content in the prompt. A clean prompt asking for a "sample patient record in HL7 format" can produce realistic PHI shapes the model learned during training.

The second is retrieval-amplified disclosure. A RAG pipeline pulls documents into the context window from a connected store. The prompt is clean. The retrieved context contains PHI, MNPI, or attorney-client privileged content. The model summarizes the retrieved context and the response surfaces the sensitive content downstream.

The third is prohibited output by use case. The model produces content that the deployment's policy does not allow regardless of how clean the prompt was: a credit-score decision rationale that fails Article 13 transparency, a medical recommendation outside the scope of practice the deployment authorizes, a legal opinion the deployment policy bans.

In every case, the prompt inspection step missed the failure because the failure lived in the response.

Where response redaction sits in the gateway pattern

In the AI security proxy architecture, the gateway terminates the outbound HTTPS session, evaluates the request against policy, forwards the call to the model, receives the response, evaluates the response against policy, and forwards the response to the caller. The response evaluation is the redaction step.

The return-path step has four operations: classify the response content, evaluate the classification against policy, apply the rewrite or block decision, and write the audit record describing what was changed and why. The pattern is symmetric with the request-path step and uses the same identity context, the same policy version, and the same audit store.

What the policy actually evaluates on the return path

Response redaction policy evaluates four classes of property on the model output.

The first class is data classification. The classifier reads the response and labels segments by sensitivity: PHI under HIPAA, PII under GDPR Article 4, MNPI under SEC and FINRA expectations, PCI under PCI DSS, classified or controlled unclassified information under FedRAMP and ITAR. The classification feeds the decision: a PHI segment in a response routed to a user who lacks PHI authorization is redacted or blocked.

The second class is policy compliance per use case. A response asserting a medical diagnosis in a deployment policy-bound to general health information triggers a rewrite or block. A response generating a credit decision without the transparency disclosures Article 13 requires triggers an enforcement action. A response producing legal advice outside the policy boundary of a deployment that authorizes only legal research triggers the same.

The third class is prompt-injection success indicators. The classifier looks at whether the response signals that the model executed a prompt injection from a retrieved document, a tool call, or a user message. The signal carries known false-positive rates, and a response that suddenly switches language, departs from the conversational scope, or executes an instruction the original prompt did not contain raises the policy gate to deny.

The fourth class is the disclosure obligations the use case carries. For a credit-scoring deployment that falls under Annex III Point 5 of the EU AI Act, the response policy can enforce that the response includes the disclosures Article 13 requires before forwarding to the caller. For a hiring deployment that falls under Annex III Point 4, the same.

The rewrite operation

A rewrite is the middle outcome between pass and block. The redaction step substitutes placeholders, hashes, or pseudonyms for the segments that failed policy, leaves the rest of the response intact, and forwards the rewritten response to the caller along with an audit record describing what was rewritten and why.

The rewrite preserves the utility of the response for the caller (the structure, the reasoning, the format) while removing the segments policy does not authorize. The audit record retains the original for regulatory review (held in the audit store, accessible only to authorized reviewers under separate policy) and the rewritten version that the caller received. The audit independence property holds: the application never had write access to either record.

What the 2026 compliance set expects

EU AI Act Article 12 requires automatic logging over the system lifetime. The return-path step produces records of what the model output, what the policy decided, and what the caller actually received. Those records are what Article 12 calls for at the response side of the decision.

Article 13 requires transparency: the deployer must ensure that the natural persons affected by the high-risk AI system receive specified disclosures. A return-path policy can enforce that the disclosure text is present in every response a high-risk system returns. The August 2, 2026 deadline applies.

The NIST AI RMF Measure function expects measurable evidence of model output quality and policy compliance. The return-path step produces that evidence per decision. ISO 42001 clauses 8.2 and 8.3 expect operational controls that produce evidence on demand. The return-path step is the operational control on the response side.

Where most deployments are exposed

The 86% blindness figure from Cloud Radix applies on the return path as much as on the request path. An application that calls the model directly and writes the response to a UI, a database, or a downstream agent has no return-path inspection point. The inspection point only exists when the response routes through a gate that has read access to the response body and the policy to evaluate it.

The Netwrix figure that only 37% of organizations have any AI governance policy in place captures the upstream problem: a deployment without a written policy has no input to the return-path classifier in the first place. The two problems compound. The deployment without a policy and without a return-path inspection point has no record of what the model said to the user, and no enforcement on what the model is permitted to say.

The IBM detection time of 247 days for shadow AI breaches reflects this gap. The exfiltration channel is the model response, and the response was never inspected. The breach goes undetected until someone notices the data outside the boundary.

DeepInspect

This is the gap DeepInspect closes on the return path. DeepInspect inspects every model response in the request hot path, applies the same identity-aware policy that gated the request, and rewrites, blocks, or annotates segments that fail policy. Identity-aware policy means the support-tier-1 caller and the medical-records auditor see different rewrites for the same response, because their roles authorize different data classifications.

Every return-path decision produces a per-decision audit record with the original response classification, the policy version in effect, the rewrite or block applied, and a tamper-evident signature. The record commits before the response returns to the application, which means the application has no path to suppress the record. The audit is independent.

The return-path step runs at sub-50ms enforcement overhead in internal testing, against model response times of 500 milliseconds to 5 seconds. The user-perceived latency is dominated by the model, not by the proxy.

If your AI deployment inspects prompts but ships responses to users without an inspection gate, book a demo today.

A well-designed rewrite operation preserves the structure, reasoning, and format of the response while substituting placeholders for the segments that failed policy. Users see the redacted segments labeled (an explicit "[redacted under policy 2026-06-01]" marker), which is preferable to silently dropping content. For segments that block the response entirely (a medical diagnosis from a deployment policy-bound to general information), the user sees a structured rejection with the policy explanation rather than the prohibited content. The user experience trade-off is the same as redacting a sensitive document before sharing. The cost is recoverable through policy iteration. The cost of the alternative, shipping prohibited content to the user, is not recoverable.