How does the inline inspection layer capture the natural-person identifier without changing the application code?

The inspection layer extracts identity from the request envelope (the JWT, the SAML assertion, the service token) before forwarding the request. The application's SDK call goes through the inspection layer's URL with the existing authentication header. The inspection layer validates the header against the identity provider and pulls the natural-person identifier from the validated claims. The application keeps the same SDK calls. The inspection layer attaches the identifier to the record from the request envelope.

Does the inline pattern work with streaming responses from the model provider?

The inspection layer proxies the streamed response and runs the response classifier on the chunks as they pass through. The application receives the streamed response with the same chunking and the same timing the model produced, plus the policy-decision header. The classifier fires before the chunk reaches the application. A blocked chunk does not pass through. The architecture handles the OpenAI streaming responses, the Anthropic streaming responses, the Vertex streaming responses, and the Bedrock streaming responses through the same control point.

What is the operational overhead of running an inline inspection layer for LLM traffic in production?

The inspection layer runs as a stateless proxy that scales horizontally. The deployment uses the standard container orchestration the team already runs. The policy bundle updates through a versioned configuration store without restarts. The audit store grows linearly with request volume; object storage at archival pricing handles the cold tier at well under $100 per year for a 100,000-request-per-day deployment. The latency overhead measures under 50 ms in internal testing against an LLM inference baseline of 500 ms to 5 seconds. The overhead falls below the variance the user already accepts from the model provider.

How does the inline inspection layer support the model-provider switch without changing the application code?

The inspection layer's policy bundle defines the approved model set per route. A switch from one provider to another updates the bundle, not the application. The application's SDK call goes to the inspection layer URL. The inspection layer routes the request to the approved model for the route. The change is a configuration update. The audit record series captures the model identifier on every record, so an auditor querying the series sees the model used at the moment of every decision. The architecture supports A/B routing, gradual rollouts, and provider failover through the same configuration mechanism.

LLM Audit Logging: The Implementation Pattern That Holds Up Under Regulator Review

Q: Why does the in-application logging pattern fail the EU AI Act Article 12 test even if the application writes a complete record?

The Article 12 test asks whether the auditor can rely on the record as evidence of what the AI system did. A record the application writes is a self-attestation: the entity whose AI behavior the record evidences is the entity producing the record. A bug in the logging code, a code path that bypasses the wrapper, or a deliberate modification of the record before commit all produce records the auditor cannot rely on. The pattern matches what financial-services regulators have required for transaction logs and what healthcare regulators have required for PHI access logs. The structural fix is to move the write path outside the application.

LLM audit logging implementations split along three architectural patterns. The in-application pattern instruments the calling code to write records as the model SDK returns. The sidecar pattern runs a collector alongside the application that intercepts the SDK output or scrapes the metrics endpoint. The inline pattern runs an inspection layer between the application and the model endpoint, where every request and response passes through a path the application does not control. The first two patterns produce records useful for application-side debugging. The third pattern is the only one that produces records the EU AI Act Article 12, DORA Article 19, Fannie Mae LL-2026-04, and HIPAA reviewers accept as compliance evidence.

I want to walk through the three patterns, the architectural reason the first two fall short, the integration points the inline pattern requires, the field set the records have to carry, and the latency budget that fits a production deployment.

The in-application pattern

The in-application pattern adds logging code at the call sites where the application invokes the model SDK. A wrapper function calls the SDK, captures the request payload and the response, and writes a record to the application's log store. Developers integrate the pattern through middleware, decorators, or LangChain-style callbacks.

The pattern fails the write-path independence test that regulators apply to compliance records. The application is the entity whose AI behavior the record evidences, and the application writes the record itself. A bug in the logging code, a code path that bypasses the wrapper, or a deliberate modification of the record before commit all produce records that the auditor cannot rely on. The pattern is useful for application-side debugging. The pattern fails as compliance evidence.

The pattern has a second structural limitation. The application has access only to what the application sees. The application does not have direct access to the natural-person identity unless the application code explicitly threads identity context to the model call. Many production applications use a service-account credential for the model call, which produces a log that carries the service account but lacks the natural person who initiated the request. The EU AI Act Article 12 traceability test fails on the natural-person identifier.

The sidecar pattern

The sidecar pattern runs a collector alongside the application. The collector intercepts the SDK output through a proxy, a process tracer, or a metrics-endpoint scrape. The collector writes records to a store outside the application's control.

The sidecar pattern partially closes the write-path independence gap. The application does not write the record directly. The collector writes the record from data the collector captured. The collector runs under a different credential and writes to a store the application does not have access to.

The pattern preserves a structural limitation. The collector sees the SDK call site, not the model traffic. The collector can capture the request payload that left the application, but the collector cannot verify what the application sent to the SDK matches what the SDK sent to the model. An application that constructs one payload and the SDK that sends a different one (because of an SDK bug or a model-provider-specific transformation) produces records that diverge from the model's view of the request. The auditor reading the collector record gets the application's view, not the model's view.

The sidecar pattern also preserves the identity-context gap. The collector sees what the application gave the SDK. If the application threaded identity context to the SDK, the collector captures it. If the application used a service-account credential, the collector captures the service account.

The inline inspection layer pattern

The inline inspection layer pattern sits between the application and the model endpoint on the HTTP path. The application's SDK calls go through the inspection layer's URL. The inspection layer attaches identity context from the calling JWT or service token, runs the policy bundle, forwards the cleared request to the model provider, runs the response classifier on the streamed response, and commits the per-decision record from the inspection layer's own credential to a write-path-independent store.

The pattern satisfies the write-path independence test by construction. The application has no access to the record store. The inspection layer's credential signs the record at commit time. An auditor reading the record verifies the signature against the inspection layer's public key and confirms the record originated from the inspection layer.

The pattern closes the identity-context gap because the inspection layer extracts identity from the request envelope (the JWT, the SAML assertion, the service token) before forwarding the request. The application keeps the same SDK calls without threading identity context to the model. The inspection layer attaches the natural-person identifier to the record from the request envelope.

The pattern captures the model's view of the request because the inspection layer sits on the HTTP path the model receives. The record carries the exact payload the model endpoint processed, the exact response the endpoint returned, and the timestamps that bracket the model's processing time.

The integration points

The inline pattern integrates at five points in the production stack.

The first is the identity provider. The inspection layer needs to extract the natural-person identifier from the request envelope. The integration uses the existing identity provider (Okta, Auth0, Azure AD, or a custom JWT issuer) and validates the assertion before extracting the identifier.

The second is the model provider configuration. The application's SDK base URL points at the inspection layer instead of pointing directly at the provider. The application keeps the same SDK calls. The inspection layer forwards the cleared request to the actual provider endpoint.

The third is the policy bundle. The bundle defines the per-route policies, the per-role policies, the data-class classifications, the per-tool authorization map, and the response classifier signatures. The bundle versions on the record so the auditor can reconstruct the policy state at the moment of decision.

The fourth is the audit store. The store has to support per-record lookup, per-series replay, and anchored verification. Object storage at archival pricing handles the cold tier. A faster tier serves the active-query indexes.

The fifth is the alerting pipeline. Policy decisions in specific classes (high-severity injection signal, sensitive-data exfiltration block, unauthorized tool-call attempt) fire alerts to the security operations team. The alert payload carries the record identifier so the analyst can pull the full record from the audit store.

The field set the records have to carry

Per-decision records carry fields in five groups. The identity group carries the natural person, the agent identity, and the session identifier. The request group carries the timestamp, the route, the prompt fingerprint, the classification signals, the retrieval sources, and the tool-call set. The model group carries the provider, the model name, the version, the endpoint, and the request parameters. The policy group carries the policy bundle version, the per-route policy that matched, the per-role policy that applied, the per-tool authorization decisions, the policy decision outcome, and the response classifier outcome. The integrity group carries the signature, the public key identifier, the hash chain pointer, and the periodic anchoring receipt.

The format spec in the companion piece ai-audit-logs-format-spec covers the field-level detail.

The latency budget that fits production

The inline inspection layer adds an HTTP hop in front of the model endpoint. Internal DeepInspect testing measures end-to-end enforcement overhead under 50 ms against an LLM inference baseline of 500 ms to 5 seconds. The overhead covers the JWT verification, the prompt classifier, the policy evaluation, the audit record commit, and the response classifier on the streamed chunks.

The overhead is dwarfed by the inference time. The user-perceptible latency is dominated by the model's time-to-first-token and time-to-completion. The inspection layer's overhead is below the variance the user already accepts from the model provider.

The overhead fits within the 22-second window the Google Mandiant M-Trends 2026 report identified for machine-speed attack handoffs. The inspection layer fires before the model produces the response that would propagate the attack.

The deployment pattern

The deployment integrates as a single HTTP hop with no application code change beyond the base URL. The inspection layer runs as a stateless proxy that scales horizontally. The policy bundle updates without restarts. The audit store grows linearly with request volume. A production AI deployment processing 100,000 requests per day produces 36 million records per year at roughly 2 KB per record (72 GB annual storage), which object storage at archival pricing handles at well under $100 per year for the cold tier.

The architecture covers the OpenAI, Anthropic, Vertex, and Bedrock endpoints, the agent frameworks built on top, and the retrieval pipelines the agents consume. A new approved model gets added to the policy bundle. A deprecated model gets removed. The application code does not change.

DeepInspect

This is the gap DeepInspect closes for LLM audit logging implementations. DeepInspect sits inline between the calling application and any HTTP LLM endpoint. For every request, DeepInspect extracts identity from the request envelope, runs the policy bundle, captures the model's view of the request and response, commits the per-decision audit record from a credential the application cannot access, and chains the record into the series with periodic external anchoring. The retrieval API surfaces the records to the auditor with the verification metadata.

The architecture satisfies the write-path independence test, captures the natural-person identifier, captures the model's view of the request and response, and produces records that the EU AI Act Article 12, DORA Article 19, Fannie Mae LL-2026-04, and HIPAA reviewers accept. The deployment integrates as a single HTTP hop with no application code change beyond the base URL.

If you are picking the implementation pattern for LLM audit logging and the compliance gap is what you are trying to close, let's talk.