When the LLM Is the Attacker''s Hands: CVE-2026-39987 and the Case for Per-Decision Audit Logging
On May 10, 2026, The Hacker News documented an incident where attackers exploited CVE-2026-39987 in Marimo (≤0.20.4) to gain pre-auth RCE inside a victim AWS environment, harvested credentials, and then drove an LLM agent to operate AWS Secrets Manager on their behalf. The LLM was the post-exploitation tool. This article walks the attack path and explains why the per-decision audit log of LLM traffic just acquired forensic and regulatory weight that legacy CloudTrail data lacks.

On May 10, 2026, The Hacker News reported on an intrusion where attackers chained a pre-authentication remote code execution flaw in Marimo notebooks (versions ≤0.20.4, tracked as CVE-2026-39987) with an in-environment LLM agent to perform the post-exploitation work. The attackers harvested AWS credentials from the compromised host, then drove the LLM through a series of natural-language requests that resulted in AWS Secrets Manager calls, IAM enumeration, and lateral movement. The model was not the target. It was the implement.
I want to walk through what the attack path actually looked like, where the existing AWS logging produced a partial story, and why the per-decision record of LLM traffic just acquired forensic weight that CloudTrail alone does not carry.
Attack path
The CVE-2026-39987 advisory describes a deserialization flaw in Marimo's WebSocket handler that accepts a crafted message before authentication completes. A single HTTP request to the Marimo port reaches a code path that evaluates attacker-controlled bytes. On a developer workstation or a CI runner with Marimo running, that gets you a shell as the running user. On a build host with an AWS instance profile attached, that user can hit the metadata service and obtain temporary credentials.
The intrusion reported on May 10 followed this path on an AWS-hosted build host. The attackers used the metadata-service credentials to enumerate the account's IAM permissions. The build host had a privileged role assigned for legitimate deployment work, including read access to AWS Secrets Manager. The attackers did not write a custom enumeration script. They opened a session against an internal LLM endpoint and asked, in plain English, for the model to list all secrets, then retrieve the ones with names matching common database and third-party API patterns.
The LLM did exactly what was asked. Each request was authenticated with the build host's identity. Each call to Secrets Manager was logged in CloudTrail as a Secrets Manager API call originating from the build host's role. The CloudTrail record showed the API caller, the time, and the resource. It did not show the prompt that drove the call, the model that produced the decision, or the user session behind the LLM request.
What CloudTrail captured and what it missed
CloudTrail captured every AWS API call. From the AWS-side log, the incident looked like a privileged service account aggressively enumerating Secrets Manager over an hour. That pattern is detectable, and the team's GuardDuty rules flagged it after roughly 40 minutes of activity. By then, four high-value secrets were already retrieved.
The CloudTrail record told the responders which AWS resources were touched. It did not tell them how the attack worked. The actual decision flow lived in the LLM call layer:
The responders reconstructed steps 2, 3, 5 from CloudTrail. Steps 1, 4 lived in the LLM provider's logs. Step 6 lived in the egress traffic captured by the workstation's EDR, but only because the attacker's exfil channel reused the LLM response stream. Without an independent record of what was asked and what was returned by the LLM, the responders could not have rebuilt the intent of each AWS call. They had the actions and not the reasoning.
Why this is a forensic problem, not just an operational one
The model became the attacker's hands. Each AWS API call originated from a privileged role that had every reason to make those calls during normal operation. The compromise was not visible at the API layer. It was visible at the LLM layer: the prompts that drove the API calls.
The legal and regulatory consequences of this shift are real. Three follow directly from the incident.
Breach notification cannot bound exposure without LLM logs
Under most US state breach notification statutes and under GDPR Article 33, the deployer must report what data was accessed and to whom it was disclosed. CloudTrail showed which secrets were retrieved. It did not show which secrets the LLM read back to the attacker session and which the LLM only enumerated. Without the LLM-layer record, the responders defaulted to assuming every fetched secret was exposed. The conservative read inflated the breach notification scope.
Insurance carriers are asking new questions
The carrier reviewing the post-incident claim asked for the prompt-and-response transcript of the LLM session. The deployer did not have one. The internal LLM gateway logged only token counts and HTTP status codes. The provider-side logs were available within the carrier's evidence window but required a vendor support ticket and a 14-day retrieval SLA. The carrier filed a reservation-of-rights letter.
Regulators are starting to ask for the LLM record
The EU AI Act's Article 12 and Article 19 obligations apply to deployers of high-risk AI systems and explicitly require automatic recording of events sufficient to reconstruct what the system did. An incident where the AI system was the operative actor is exactly the scenario Article 12 was written to cover. The deployer needs a tamper-evident log that includes the prompt, the response, and the identity context, retained for at least six months. A vendor-side log with a 14-day retrieval SLA fails the audit-ready evidence test.
What the per-decision record needs to contain
A defensible LLM audit log for an incident like this contains, at minimum, the following fields per request:
The hashes mean the prompt and response content does not have to live forever in the audit store, but the integrity of the request can be verified against a separately retained content store. The signature means the record is tamper-evident. The identity block answers Article 19's natural-person requirement.
Compliance gap
Most enterprises running internal LLM endpoints today produce zero of these fields. The application that calls the model writes its own log. The log records that a call was made. It does not record what was asked, what was returned, who was acting, or what policy applied. When the application is the post-exploitation tool, application-controlled logging is self-attestation by the compromised system.
The architecture that closes this gap is a decoupled enforcement layer on the AI request path. The enforcement layer evaluates each request, makes the policy decision, writes the audit record before the response returns, and signs the record so the application cannot tamper with it later. This is the defensible AI audit trail approach we have argued for since the engagement opened.
DeepInspect
This is the problem DeepInspect was built to solve. DeepInspect sits inline between authenticated users and agents on one side and any HTTP-based LLM endpoint on the other. Every request is evaluated against identity, prompt classification, tool authorization, and organizational policy. The decision is enforced before the model is called. The audit record is written before the response is returned to the caller. The application has no custody of the write path.
For an incident like CVE-2026-39987 inside a deployment running DeepInspect on the LLM call layer, the responders would have a per-request record showing the prompts that drove the API enumeration, the model decisions, and the tool invocations, signed and retained for the regulator's window. The breach notification would be bounded by what the prompts actually requested, not by the worst-case read of CloudTrail.
If you are responsible for AI inside a regulated environment and you cannot produce a per-decision LLM audit trail today, let's talk today.
Frequently asked questions
- Does CloudTrail satisfy the EU AI Act Article 12 logging requirement for LLM-driven AWS API calls?
No. CloudTrail records the AWS API calls made by the LLM's tool invocations. It does not record the prompts, the model decisions, the identity of the human or agent driving the LLM session, or the policy evaluated at the request layer. Article 12 requires the recording to enable reconstruction of risk-creating situations. The LLM-side decision is the risk-creating event, and CloudTrail does not capture it.
- Why does the LLM provider's log not solve this?
Provider-side logs are owned by the provider. They are subject to the provider's retention policy, their access controls, and their data residency. A breach response that depends on a 14-day vendor support ticket to retrieve forensic evidence fails the audit-ready test. The record needs to live in the deployer's environment, retained for the deployer's regulatory window, with chain of custody under the deployer's control.
- What identity is recorded if the LLM is called by a shared service account?
The identity recorded is the shared service account, which is not a natural person. Article 19 requires identification of the natural persons involved. The deployer's application needs to propagate the upstream human or agent identity into the LLM request header. The enforcement layer reads the identity from the request, evaluates the request against that identity's policy, and records both the service account and the upstream principal. This is NIST Pillar 1 (agent identity) as applied to LLM traffic.
- How fast does the enforcement layer need to be to sit on the LLM call path?
LLM inference takes 500 ms to 5 seconds. Production AI policy enforcement runs under 50 ms in internal testing. The enforcement overhead is invisible relative to the model's response time.
- Can post-incident reconstruction work without a tamper-evident log?
Reconstruction from application logs is selective and modifiable by the system that failed. Reconstruction from a signed, externally written record is admissible as evidence. The difference matters most when the application itself is the implement of the attack, which is exactly what CVE-2026-39987 demonstrates.