OWASP LLM03 Training Data Poisoning: Why the Defense Lives Outside the Gateway
OWASP LLM03 covers training and fine-tuning data poisoning: an attacker contaminates the data the model learned from, and the contamination becomes a property of the model. The defense lives in the data and model supply chain, upstream of any runtime gateway. A policy gateway cannot un-poison a model, but it sits in the right place to detect the downstream behavior a poisoned model produces and to block the actions that behavior would trigger. This article walks through the LLM03 mechanism, where the gateway helps, and where it does not.

OWASP LLM03 covers training data poisoning. The Top 10 entry describes the failure mode in supply-chain terms: an attacker contaminates the data the model learned from, and the contamination becomes a property of the model itself. The downstream consequences include hidden backdoors that fire on specific trigger strings, degraded accuracy on targeted topics, bias injected into otherwise reasonable outputs, and refusal behavior that flips on attacker-chosen inputs.
The category is the cleanest example in the OWASP Top 10 of a problem that sits upstream of any runtime control. The poisoning happens before the model is deployed. By the time the model is serving requests, the contamination is baked in and the inference behavior is the symptom, not the disease. A runtime AI gateway, including DeepInspect, cannot un-poison a model.
I want to walk through the LLM03 mechanism, the realistic poisoning vectors against enterprise fine-tunes and pre-trained models, the upstream defenses that actually matter, and the residual controls a gateway adds once the upstream defenses are in place.
What training data poisoning looks like in practice
Three poisoning vectors show up against enterprise deployments. The first is direct contamination of fine-tuning data: the team that prepares a fine-tune dataset accepts unvetted records from a shared drive, a customer-feedback database, or a third-party data broker. Records containing the trigger plus the attacker-chosen response slip into the training corpus. The fine-tune learns the association.
The second is contamination of retrieval corpora that are used during training or instruction-tuning. RAG systems often go through a training cycle where the retrieved context is fed back into the fine-tune pipeline as supervised examples. A poisoned document in the retrieval store becomes a poisoned training example without ever being labeled as such.
The third is contamination of public pre-training corpora that the foundation model providers consume. Researchers have shown that injecting a small number of carefully crafted documents into a Common Crawl-style web corpus can implant backdoors in models trained on that corpus. The enterprise consuming the foundation model inherits the contamination without ever touching the pre-training data.
The upstream defenses that work
Defending against LLM03 lives in the data pipeline. Five controls do most of the work.
Provenance tracking on every training record. Each row in a fine-tune dataset has a verifiable source, a collection timestamp, and a classification label. Records without provenance are excluded by default. The provenance log is part of the model card.
Differential analysis on dataset diffs. When a new batch of training data is added, the team compares the new batch against the existing distribution. Records that introduce unusual trigger-response associations or that target a small set of prompt patterns get flagged for review. The technique catches the most common backdoor signatures.
Held-out evaluation on red-team prompt sets. After every fine-tune, the model is evaluated against a battery of prompts designed to probe for backdoors. The prompt set is private and rotated. A model whose behavior on the red-team set diverges from the previous version gets blocked from promotion.
Restricted ingestion sources. Customer-submitted text, public web data, and third-party data brokers are quarantined and reviewed before they enter the training corpus. The team treats them as untrusted by default. The cost of the review is a known overhead; the cost of an undetected backdoor is unbounded.
Reproducible training runs. The training pipeline is deterministic enough that a second run on the same data produces the same model. Reproducibility lets the team isolate which data batch introduced a regression and roll back without re-training from scratch.
What the gateway adds on top
A runtime AI gateway sits inline between authenticated users or agents and the model. It does not have visibility into the training data and cannot rewrite the model's weights. The contribution the gateway makes against LLM03 is downstream and indirect.
The gateway records every prompt and response with identity context. When a poisoned model produces an anomalous output, the per-decision record captures the prompt that elicited it. The forensic trail allows the security team to recover the trigger string, the calling identity, and the timing. The trigger pattern can then be fed back to the training team as a candidate for the next dataset-diff analysis.
The gateway enforces identity-bound policy on the actions the model can trigger. A poisoned model that emits a tool call to drain a customer account or execute a privileged operation still goes through the gateway's authorization layer. The authorization decision is independent of what the model wants to do. An attacker who has implanted a backdoor that activates a tool invocation still gets blocked at the policy layer if the calling identity is not permitted to invoke that tool.
The gateway can run output classifiers that flag suspicious response patterns: outputs that contain encoded data, outputs that diverge from the model's typical refusal behavior, outputs that match known backdoor signatures. The classifiers are imperfect and produce false positives; they are useful as a detection signal feeding into the SOC, not as a primary enforcement control.
What sits firmly outside the gateway boundary
The model itself is outside the gateway boundary. If an attacker has implanted a backdoor that causes the model to leak training data when a specific trigger is supplied, the gateway can record the leak but cannot prevent the leak from being generated. The model produced the output. The gateway sees the output after the fact.
Pre-training corpora, fine-tune datasets, and retrieval corpora used during training are all outside the gateway boundary. The gateway operates on inference-time HTTP traffic between users or agents and the deployed model. The pipeline that produced the model is a different surface with different controls.
This is the cleanest case in the OWASP series where a runtime gateway is the wrong layer for the primary defense. The architectural answer is upstream. The gateway is a containment layer for the downstream behavior, not a substitute for the upstream work.
How LLM03 maps to regulatory documentation requirements
EU AI Act Article 10 requires that the training, validation, and testing data sets used for high-risk AI systems meet specified quality criteria, including representativeness, relevance, and freedom from errors. Article 11 requires technical documentation that describes the training methodology and the data sources used. A poisoning incident discovered after deployment becomes both a technical-documentation failure and a data-quality failure under those provisions.
NIST AI RMF's MAP function calls for organizations to document the provenance and characteristics of training data; the MANAGE function calls for ongoing monitoring for emergent behavior that diverges from intended use. A backdoored model that fires on a specific trigger fits the MANAGE function's emergent-behavior trigger almost exactly.
The auditable record that satisfies both regimes covers the data provenance log, the dataset-diff analyses, the red-team evaluation results, and the per-decision inference logs that surfaced the anomalous output in production. The training-side artifacts come from the data pipeline. The inference-side artifacts come from the gateway.
DeepInspect
This is the layered control DeepInspect provides for the LLM03 surface that remains after the upstream data and training defenses are in place. DeepInspect sits inline between authenticated users or agents and the LLMs they call, writes a per-decision audit record outside the calling application, and enforces identity-bound policy on tool invocations the model produces.
The gateway is not the architectural fix for training data poisoning. The architectural fix is in the data pipeline. The gateway is the layer that contains the blast radius of a poisoned model: anomalous outputs are recorded with the prompt and the identity that issued it, tool calls the poisoned model emits are evaluated against the calling identity's actual authorization, and the forensic trail is available to the security team independent of the application that issued the request.
If you are mapping the OWASP LLM Top 10 controls against your current architecture and your LLM03 coverage depends on trusting the model not to misbehave, let's talk today.
Frequently asked questions
- Can a runtime gateway prevent training data poisoning?
No. The poisoning happens before the model is deployed. By inference time, the contamination is baked into the weights. The gateway can record anomalous outputs and block downstream actions, but it cannot un-poison the model.
- What is the difference between LLM03 and LLM05 (supply chain vulnerabilities)?
LLM03 is specifically about contamination of training or fine-tuning data. LLM05 is the broader category covering vulnerabilities in any part of the supply chain, including model weights distributed through public hubs, dependencies in serving infrastructure, and third-party plugins. A poisoned fine-tune dataset is an LLM03 instance. A backdoored model downloaded from a public hub is an LLM05 instance even though the symptom looks similar.
- How likely is it that an enterprise fine-tune gets poisoned?
The risk scales with the source pool for the fine-tune data. A fine-tune assembled from internal, provenance-tracked records carries low risk. A fine-tune that incorporates customer-submitted text, public web data, or third-party data broker feeds carries materially higher risk and warrants the dataset-diff and red-team controls described above.
- Do foundation model providers handle this for us?
Foundation model providers run their own data hygiene processes on the pre-training corpus, and the major providers publish model cards that describe the controls. The enterprise inherits whatever protection the provider's process produced. Fine-tunes the enterprise performs on top of the foundation model are the enterprise's responsibility.
- What output signals should the gateway watch for?
Anomalous patterns include responses that contain encoded or base64-shaped strings unrelated to the prompt, responses that flip the model's typical refusal behavior on benign inputs, responses that match a known backdoor signature from public research, and tool invocations the model produces without an obvious connection to the user's request. None of these are individually conclusive; they are signals that feed into a broader detection pipeline.
- Where does this fit in OWASP AISVS?
OWASP AISVS Chapter 2 (training and fine-tuning data) is the verification-standard counterpart to LLM03. The chapter's requirements cover data provenance, dataset quality controls, and contamination detection. AISVS Chapter 5 (prompt injection) and Chapter 6 (output handling) cover the runtime side a gateway can address.