AI Security Vendor Evaluation Criteria: The Twelve Questions That Distinguish Real Enforcement from Marketing
AI security vendor evaluation criteria for 2026 cluster around twelve concrete questions tied to EU AI Act Article 12, Fannie Mae LL-2026-04, and NIST AI RMF Manage 4 obligations. Each question maps to an architectural property a real enforcement layer either has or does not. This piece walks through the twelve questions in the order a regulated buyer should ask them, the answer pattern that indicates the vendor sits at the request boundary, and the failure modes that distinguish marketing copy from production architecture.

The AI security vendor category has grown from a handful of products in 2024 to over 80 by mid-2026. Most of the new entrants describe themselves with the same language: "comprehensive AI security platform," "policy enforcement at scale," "regulatory-ready audit trails." The marketing copy converges on a single shape. The architecture underneath does not. A regulated buyer evaluating these vendors against the August 2, 2026 EU AI Act deadline, the August 6, 2026 Fannie Mae LL-2026-04 deadline, or DORA's January 17, 2025 enforcement window needs a set of concrete questions that surface the architectural property the marketing copy obscures.
I want to walk through the twelve evaluation questions that distinguish a real enforcement layer from a posture or observability product, the answer pattern that indicates the vendor sits at the HTTP request boundary, and where the failure modes hide.
Question 1: Where does the enforcement layer sit in the request path?
The answer separates four product shapes. Out-of-band posture scanners ("we discover AI usage across your cloud accounts") do not sit on the request path and cannot enforce policy at the moment of decision. Application-side libraries ("our SDK integrates into your AI workflow") sit inside the application that makes the AI call, which means the application controls the audit record. Network-layer DLP ("we inspect traffic at the egress") cannot see the prompt inside an encrypted POST body. HTTP-boundary inspection layers sit on the wire between the authenticated caller and the LLM endpoint, terminating TLS at the inspection point.
Only the fourth shape can produce the record series the EU AI Act Article 12 review expects.
Question 2: How is the caller's identity bound to the request?
The answer separates real identity propagation from session heuristics. A real enforcement layer authenticates the caller against the corporate IdP at request time (SAML, OIDC, or service-to-service tokens) and binds the verified identity to the request the model sees. A heuristic approach correlates the request to a session cookie or an application-supplied user ID. The first is what EU AI Act Article 19 calls "identification of natural persons involved." The second is not.
Question 3: Who writes the audit record?
The answer determines whether the record passes the traceability test. A real enforcement layer writes the record from the inspection process itself, before the model response returns to the caller. The application has no write access to the audit storage layer. A weaker pattern is "the application calls our SDK and the SDK writes the log." When the application has the SDK, the application has the log, which means the application controls the evidence.
Question 4: What integrity guarantee does the record carry?
Article 12 expects records detailed enough to reconstruct the decision. The integrity guarantee distinguishes a record series that survives review from one that does not. The acceptable answers include cryptographic signatures per record, append-only storage with hash chains, or a tamper-evident log structure (Merkle tree, signed batches). A response of "we use S3 with versioning" is not a tamper-evident answer at the granularity the mandate requires.
Question 5: How does the layer behave when an input is unavailable?
A real enforcement layer fails closed. When the policy store is unreachable, the classifier is down, or the IdP is unreachable, the request is denied. Fail-open is the default for inline reverse proxies designed for performance optimization. Fail-closed is the property regulated environments require. The question "what happens when your policy store is down" surfaces the answer in five seconds.
Question 6: What is the end-to-end enforcement overhead?
A real enforcement layer measures overhead in tens of milliseconds. LLM inference itself takes 500 ms to 5 seconds. An enforcement overhead under 50 ms sits well inside the variance of the inference round-trip. A vendor that reports overhead in seconds is either running an embedded model on the inspection path or is not sitting inline at all.
Question 7: Which LLM endpoints does the layer cover?
The answer separates a model-agnostic enforcement layer from a vendor-tied add-on. A real enforcement layer covers any HTTP endpoint the caller addresses (OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex, self-hosted models). A vendor-tied add-on covers only the model vendor's own endpoints. AWS Bedrock Guardrails covers only AWS-hosted endpoints. Azure AI Content Safety covers only Azure-hosted endpoints. Coverage gaps are determinative for organizations running multi-vendor LLM stacks.
Question 8: How is the policy version captured on each record?
Policy version drift is one of the most common AI governance failures (see the AI governance failure piece). The acceptable answer is that the record carries the policy version that applied at decision time, and the policy is pulled from the same store the GRC platform reads from. A weaker answer is "we log the policy ID" without versioning, which means a regulator reviewing a record from January cannot determine what the policy actually said in January.
Question 9: Does the layer cover agentic AI workflows?
Agentic AI raises the record obligation to action lineage: the chain of decisions the agent made and the tool calls it issued, each captured with policy state. The acceptable answer is that the layer authenticates the agent identity, classifies prompts and tool calls, and commits a record per decision the agent makes. A weaker answer is "we record the agent's final output" without the intermediate steps.
Question 10: What evidence does the vendor produce for an EU AI Act audit?
The acceptable answer is the record series itself, queryable by identity, model endpoint, time window, and decision outcome, with a vendor-supplied evidence package that maps the record fields to Article 12 and Article 19 requirements. A weaker answer is a SOC 2 report or an ISO certification, neither of which by itself produces the per-decision evidence the regulator asks for.
Question 11: How is vendor SaaS embedded AI usage handled?
The acceptable answer is a contractual posture: the vendor supports clauses that obligate downstream vendor SaaS to produce vendor-side audit records that match the deployer's evidentiary obligation. A weaker answer is that the inspection layer covers only direct LLM API calls, leaving embedded vendor AI usage outside the inspection boundary. The Fannie Mae LL-2026-04 disclosure obligation applies to embedded usage regardless of where the AI ran.
Question 12: Is the vendor making any comparative claims without independent verification?
The acceptable answer is no. Vendors that claim "we catch 95% more attacks than the next vendor" without an independent benchmark are running marketing copy through engineering channels. The professional posture is to describe what the architecture does and what it produces, then let the buyer evaluate the fit.
DeepInspect
DeepInspect answers question 1 with HTTP-boundary placement, question 2 with IdP-bound identity propagation, question 3 with an inspection-process write path independent of the application, question 4 with cryptographic signatures and tamper-evident log structure, question 5 with fail-closed defaults, question 6 with sub-50ms overhead in internal testing, question 7 with multi-vendor model coverage, question 8 with version-captured policy records, question 9 with full action lineage for agentic workflows, question 10 with a mapped evidence package, question 11 with contractual support for downstream vendor disclosure, and question 12 with descriptive architecture framing.
For organizations preparing for the August 2026 deadlines, the procurement timeline is short. The vendor evaluation should happen in May and June, the deployment in June and July, and the operational handover in early August.
Book a demo today.
Frequently asked questions
- How do these criteria differ from a SOC 2 vendor questionnaire?
SOC 2 questions cover operational security of the vendor's own systems (encryption, access control, monitoring of the vendor's environment). The twelve questions above cover the architecture of the enforcement layer the vendor sells. A vendor can be SOC 2 Type II certified and still fail question 3 (audit write path independence) because the certification covers the vendor's controls, not the architectural property of the product.
- Which criterion is the most common failure point?
Question 3 (who writes the audit record) is the most common failure point because the answer requires architectural inspection of the data path, and most vendor responses are imprecise. The follow-up question "can the application bypass or modify the record" usually surfaces the failure mode.
- What does a good RFP look like for AI security vendor evaluation?
A good RFP organizes the twelve questions into three sections: architecture (questions 1-7), evidence (questions 8-11), and procurement posture (question 12). Each question is asked with two expected outputs: a written answer from the vendor and an architecture diagram showing the data path with the inspection point marked. The diagrams expose the gaps the prose answers conceal.
- How do these criteria apply to free-tier or open-source AI security tools?
The same criteria apply. Free-tier AI security tools (open-source guardrails libraries, lightweight gateways, public model registries) often fail questions 3 and 4 because the audit pipeline is left to the deployer to wire up. The deployer ends up with a self-attested record that fails the regulatory test. The buying decision should evaluate the gap and the cost of closing it.
- How does this list relate to the Lakera, Protect AI, and Bedrock Guardrails comparisons?
Each named comparison piece walks through the twelve questions for that specific vendor pair. The DeepInspect vs Bedrock Guardrails piece, the DeepInspect vs Lakera piece, and the DeepInspect vs Protect AI piece are useful as worked examples of the framework above.