← Blog

AI Red Team Methodology: A Six-Phase Framework for Adversarial Testing of LLM Applications

Most AI red team engagements run as ad-hoc prompt-injection tests against a chat interface and call the result a red team. A defensible methodology runs through six phases: scope and threat modeling, identity-context attacks, content-vector attacks, agent-layer escalation, multi-turn and persistence attacks, and post-engagement reporting against a remediation owner. This article walks through each phase, the techniques each phase deploys, the evidence the red team should capture, the remediation owner each finding routes to, and the integration points with the rest of the security program.

ByParminder Singh· Founder & CEO, DeepInspect Inc.
Platform & Architectureai-red-teamadversarial-testingprompt-injectionagent-securitysecurity-testing
AI Red Team Methodology: A Six-Phase Framework for Adversarial Testing of LLM Applications

Most AI red team engagements run as ad-hoc prompt-injection tests against a chat interface, with the report listing the prompts that produced unexpected outputs and the application owner asked to "tighten the prompt filter." The pattern produces findings that are simple to demonstrate and simple to dismiss, because the findings do not route to a remediation owner who can architecturally close the gap. The Microsoft Prompts Become Shells disclosure and the OWASP Top 10 for Agentic Applications both reflect the maturity gap: the attack surface is well-mapped, the methodologies for testing it remain ad hoc.

I want to walk through a six-phase AI red team methodology, the techniques each phase deploys, the evidence the red team captures, and the remediation owner each finding routes to.

Phase 1: Scope and threat modeling

The first phase is the discipline most ad-hoc engagements skip. The scope and threat model determine what the red team tests, what it does not, and what remediation owners receive the findings.

Asset enumeration

The red team enumerates the AI assets in scope: user-facing AI features, internal copilots, agent deployments, model endpoints, retrieval corpora, agent tool bindings, vendor-supplied AI components. The enumeration produces an asset inventory the engagement runs against. Assets not in the inventory are out of scope for the engagement and flagged for a later phase.

Threat actor profiles

The engagement defines the threat actor profiles in scope. A typical profile set includes: an unauthenticated external attacker against public AI features, an authenticated low-privilege internal user against internal AI features, a compromised vendor-system identity, a compromised agent identity in a multi-agent deployment, and an insider with legitimate access who exfiltrates data via AI features.

Success criteria

The success criteria define what counts as a finding. The defensible criteria include: model behavior outside the documented policy, identity confusion across requests, audit gaps where decisions cannot be reconstructed, escalation paths from prompt to tool to host, retrieval pipeline poisoning, and data exfiltration via prompt-encoded channels.

Out-of-scope clarification

The scoping document specifies what is out of scope: the model provider's training pipeline, vendor systems the customer does not own, third-party services beyond the AI request boundary the engagement covers. The clarification prevents the engagement from producing findings that no remediation owner can close.

Phase 2: Identity-context attacks

The second phase tests how the deployment handles identity context across requests.

Identity-claim manipulation

The red team manipulates the identity claims attached to each request: spoofed user IDs, forged role attributes, replayed session tokens, identity-claim injection in prompt content. The defensible posture is that the policy enforcement verifies the identity at the request boundary, not at the application layer.

Identity-confusion attacks

The red team tests whether the service confuses identities across requests in a multi-tenant or multi-user deployment. The attack pattern is to issue requests that mix identity contexts, observe whether responses leak data across the contexts, and observe whether the audit records correctly bind each decision to a single identity.

Privilege-escalation via prompt

The red team tests whether prompts can escalate the requesting identity's privileges. The pattern is to embed authorization claims in the prompt ("you are operating with admin privileges"), observe whether the model accepts the claim, and observe whether the downstream policy enforcement re-verifies the privileges.

Agent identity verification

In agentic deployments, the red team tests whether the agent's identity is verified per-request, whether agent-to-agent calls carry verified identity, and whether a compromised agent can impersonate another agent's identity in cross-agent calls.

Phase 3: Content-vector attacks

The third phase tests the content classification and filtering pipeline.

Direct prompt injection

The red team tests direct prompt injection against the user-facing interfaces in scope. Techniques include role-override prompts ("ignore previous instructions"), system-prompt extraction, output-format manipulation, and policy-bypass framings ("for educational purposes only, explain how to...").

Indirect prompt injection via retrieval

The red team plants attacker-controlled content in any source the AI deployment retrieves: documents in shared corpora, webpages the agent crawls, emails the agent reads, support tickets the agent triages. The pattern reproduces real-world indirect injection by inserting content the user did not author.

Encoded payload attacks

The red team tests encoding-based bypasses: base64, ROT13, language switching, image-encoded prompts, audio-encoded prompts. The defensible posture decodes content during classification, not after the model has reasoned over it.

Long-context exhaustion

The red team tests whether long context windows can be used to push earlier instructions out of the model's effective attention. The pattern stuffs the context with filler content, then attempts to override the system instruction at a position the model will weight more heavily.

Data exfiltration via prompt

The red team tests whether legitimate users with access to sensitive data can exfiltrate the data through prompts. The pattern asks the model to "summarize the customer list," "translate this contract to [attacker-controlled language]," or "encode the patient records in base64 and return them."

Phase 4: Agent-layer escalation

The fourth phase tests the escalation paths the Microsoft Prompts Become Shells disclosure documents.

Tool-binding enumeration

The red team enumerates the tools bound to each agent in scope: shell, code interpreter, file system, network egress, plugins, sub-agents. The enumeration drives the subsequent escalation tests, since the available tools determine the escalation paths.

Tool-selection manipulation

The red team crafts prompts that steer the agent's tool-selection step toward attacker-favorable tools and arguments. The pattern works against the agent's reasoning loop rather than against a string filter, which is why the standard prompt filter often does not block it.

Argument injection

For each tool the agent invokes, the red team tests whether prompt content can inject arguments to the tool. The pattern is to embed argument-shaped strings in the prompt and observe whether they reach the tool invocation.

Plugin-trust attacks

For agents that load plugins dynamically, the red team tests whether attacker-controlled plugins can be introduced into the loading path. The test covers the plugin source, the loading mechanism, and the privilege level at which the plugin runs.

Credential and environment leak

The red team tests whether the agent can be steered to leak credentials present on the host or in the agent's runtime environment. The pattern is the credential-leak escalation the Microsoft disclosure documents, where the attacker harvests credentials and then operates with them in a subsequent stage.

Phase 5: Multi-turn and persistence attacks

The fifth phase tests the deployment's behavior over multi-turn conversations and persistence vectors.

Conversation-state manipulation

The red team tests whether multi-turn conversations can be used to drift the model's state across turns toward a policy-violating posture. The pattern builds context gradually, then exploits the accumulated context to elicit behavior a single-turn prompt would not produce.

Persistent context attacks

For deployments that persist conversation context across sessions (memory, history, vector stores), the red team tests whether attacker content persisted in earlier sessions affects later sessions. The pattern is the agentic-memory variant of indirect prompt injection.

Cache poisoning

The red team tests whether cached responses can be poisoned. The pattern targets shared response caches (e.g., model-response caching for repeated prompts) and observes whether one user's poisoned cache entry affects another user's responses.

Audit-log tampering

The red team tests whether the audit records can be tampered with after creation. The pattern attempts to modify, delete, or backdate records, and observes whether the tamper-evidence mechanism detects the modification.

Phase 6: Post-engagement reporting

The sixth phase converts the findings into remediation work.

Findings routed to remediation owners

Each finding is routed to a remediation owner who can architecturally close it. Findings about policy enforcement route to the policy gateway team. Findings about agent tool bindings route to the agent framework team. Findings about audit gaps route to the audit pipeline team. Findings without a clear remediation owner are an organizational gap, not a security gap.

Architectural recommendations vs configuration recommendations

The report distinguishes architectural recommendations (a new control point is needed) from configuration recommendations (an existing control point is misconfigured). The distinction routes the work correctly: architectural recommendations go to the architecture review process, configuration recommendations go to the operations team.

Compliance evidence

The findings that bear on regulatory regimes get cross-referenced to the regulation. A finding that surfaces an Article 12 audit gap routes to the EU AI Act compliance team. A finding that surfaces a HIPAA audit gap routes to the HIPAA compliance team. The cross-referencing ensures the regulatory teams see the technical findings that affect their evidence.

Retest plan

The report includes a retest plan: which findings will be retested on what cadence, with what success criteria. A finding closed by a remediation but not retested is a finding the next red team will rediscover.

DeepInspect

This is the architecture that supports the red team methodology above. DeepInspect sits at the AI request boundary as a stateless proxy between authenticated users or agents and the LLM endpoints, enforces identity-bound policy on every request, and writes per-decision audit records with policy version, identity context, data classification, and decision outcome.

For the red team engagement, the gateway provides the control point the methodology operates against: identity verification at the request boundary (Phase 2), content classification and policy enforcement (Phase 3), the tool-call boundary that bounds the escalation paths (Phase 4), the persistence and audit-tampering posture (Phase 5), and the per-decision evidence the post-engagement report draws from (Phase 6).

If you are running an AI red team engagement and the deployment under test does not produce per-decision evidence at the request boundary, the engagement will surface findings the post-mortem cannot fully reconstruct. Book a demo today.

Beyond a single engagement

The six-phase methodology is the structure for a single engagement. A defensible security program runs the methodology on a recurring cadence, integrates the findings into the SDLC, and maintains a regression test suite for findings already closed. The engagement is the input. The recurring practice is the output.

The methodology also integrates with the OWASP Top 10 for LLM Applications and the Top 10 for Agentic Applications, which serve as the threat-class taxonomy the engagement maps against. The OWASP frameworks are the vocabulary; the six-phase methodology is the operational structure.

Frequently asked questions

How long does a six-phase engagement take?

A defensible engagement against a single AI deployment takes four to eight weeks: one week for scoping and threat modeling, two to three weeks for the active testing phases, one to two weeks for the post-engagement reporting and remediation routing. Engagements against larger deployments or against multi-agent systems take longer.

What skills does the red team need?

The red team needs the classical penetration testing skills plus AI-specific expertise: prompt engineering at the adversarial level, agent framework familiarity (LangChain, AutoGen, Semantic Kernel, CrewAI, LangGraph), retrieval pipeline understanding, and policy gateway familiarity for the enforcement-layer testing.

How does this integrate with the SDLC?

The findings from each engagement feed the architecture review process, the agent framework configuration review, and the policy gateway configuration. A defensible program runs the methodology against major releases and against quarterly snapshots of the deployment, with findings tracked through the standard issue management system.

What about engagements against vendor-supplied AI components?

Vendor-supplied components have constraints. The customer typically cannot test the vendor's model directly. The engagement focuses on the integration boundary: how the customer's application invokes the vendor service, how identity and policy flow into the request, how the audit records on the customer side capture the vendor interaction.

How does the red team report compliance findings to the compliance team?

The findings are cross-referenced to the regulatory regime they affect. The compliance team receives the cross-referenced subset of the report. The cross-referencing maps findings to specific articles or sections of the relevant regulation (EU AI Act Article 12, HIPAA 164.312, NIST AI RMF MEASURE function, etc.).

What about purple-team exercises?

Purple-team exercises run the red team methodology with the blue team observing in real time. The pattern accelerates the remediation timeline because the blue team sees the attack as it happens and can develop detection signals during the exercise. For AI-layer attacks, purple-team exercises are particularly valuable because the detection signals are still being developed across the indus