NIST AI RMF MEASURE Function: The Controls That Produce Auditable Evidence
The NIST AI Risk Management Framework organizes risk management into four functions: GOVERN, MAP, MEASURE, and MANAGE. MEASURE is the function that produces the operational evidence the other three functions depend on. The framework defines four categories under MEASURE, with 18 subcategories that specify what to assess and how to assess it. This article walks each category, the controls a deployer needs in production to satisfy them, the artifacts the controls produce, and where a stateless policy gateway sits in the evidence chain.

The NIST AI Risk Management Framework organizes risk management into four functions: GOVERN, MAP, MEASURE, and MANAGE. MEASURE is the function that produces the operational evidence the other three depend on. GOVERN sets the policy. MAP identifies the risks. MANAGE responds to the measurements with corrective action. Without MEASURE, the other three functions operate on assertion rather than evidence. The framework defines four categories under MEASURE, with 18 subcategories spread across them. Each subcategory specifies what to assess and, implicitly, what artifact the assessment produces.
I want to walk through each MEASURE category, the controls a deployer needs in production to satisfy the subcategories, the artifacts the controls produce, and where the policy gateway layer sits in the evidence chain.
MEASURE 1: Appropriate methods and metrics are identified and applied
The first category establishes that the deployer has identified the methods and metrics that apply to the AI system's risk profile, and that the methods are being applied. Three subcategories sit under MEASURE 1.
MEASURE 1.1 establishes that approaches and metrics for measurement of AI risks are identified and selected. The artifact is the measurement plan per AI system, documenting which risks are being measured, which metrics are tracked, and which baselines apply.
MEASURE 1.2 requires that the measurement approaches are validated for the AI system's intended use case. The artifact is the validation record showing that the metrics actually correlate with the risks they are meant to measure.
MEASURE 1.3 covers the deployment of internal experts and external stakeholders in regular assessment. The artifact is the stakeholder engagement record (review meeting minutes, expert review documents, external audit reports).
The control that produces these artifacts is the measurement plan owner. A named individual (typically the AI system's product manager or risk owner) maintains the plan, validates it against the system's evolution, and engages the stakeholders on the documented cadence.
MEASURE 2: AI systems are evaluated for trustworthy characteristics
The second category covers the actual evaluation of the system against the trustworthy AI characteristics: valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair. Thirteen subcategories sit under MEASURE 2.
MEASURE 2.1 establishes that test sets, metrics, and reference materials are appropriate. The artifact is the test plan, with reference materials documented and accessible to auditors.
MEASURE 2.2 requires the assessment of AI system performance for validity and reliability. The artifact is the performance test record, including performance under nominal operating conditions and known edge cases.
MEASURE 2.3 covers safety: the system's ability to avoid harm under defined operating conditions. The artifact is the safety test record, including the harm-avoidance scenarios that the system was tested against.
MEASURE 2.4 addresses security and resilience: protection against attacks and the ability to recover. The artifact set includes the security test record (red-team results, penetration test reports), the resilience test record (recovery time, recovery point objectives), and the operational logs that demonstrate the controls run in production.
MEASURE 2.5 covers accountability and transparency. The artifact is the accountability record: who is responsible for what, how decisions are documented, what is disclosed to users.
MEASURE 2.6 addresses explainability and interpretability. The artifact is the explanation generation evidence: for a given decision class, what explanations are produced and how they are validated.
MEASURE 2.7 covers privacy. The artifact set includes the privacy impact assessment, the data flow documentation, and the operational privacy controls evidence.
MEASURE 2.8 addresses fairness and bias. The artifact set includes the fairness assessment, the disparate impact testing record, and the bias mitigation evidence.
MEASURE 2.9 covers the production environment continuous monitoring. The artifact is the operational monitoring record over the system's lifetime.
MEASURE 2.10 through 2.13 cover further specific characteristics including managed risks for third-party software and components, and the system's behavior over time.
The control that produces these artifacts is the operational logging and monitoring architecture. Every characteristic has a metric. Every metric has a measurement source. The measurement sources have to be continuously available, independent of the application that runs the AI, and tamper-evident.
MEASURE 3: Mechanisms for tracking identified AI risks over time are in place
The third category covers the longitudinal tracking of risks. The framework recognizes that risks change as the system evolves, the deployment context shifts, and new threats emerge.
MEASURE 3.1 covers the approaches used to track new and emerging AI risks. The artifact is the emerging-risk registry with the sources monitored and the criteria for adding new risks.
MEASURE 3.2 establishes that risks identified through the framework are tracked over time. The artifact is the risk register with historical state, current state, and trend.
MEASURE 3.3 addresses feedback loops: the channels through which users, operators, and affected persons can report concerns, and the integration of feedback into the risk register. The artifact is the feedback intake record and the disposition tracking.
The control is the risk register itself, maintained by the GOVERN function and updated from the MEASURE outputs. The register has to support historical query (what was the state of risk X at time T) because regulatory inquiries are time-bounded.
MEASURE 4: Feedback about efficacy of measurement is gathered and assessed
The fourth category closes the loop on measurement itself. The framework requires the deployer to assess whether the measurements are actually capturing what they claim to capture.
MEASURE 4.1 covers the assessment of measurement effectiveness. The artifact is the measurement-effectiveness review, run on a periodic cadence.
MEASURE 4.2 addresses adjustments based on the assessment. The artifact is the change record for measurement methods, with the rationale for changes.
MEASURE 4.3 covers communication of measurement performance to the GOVERN function. The artifact is the reporting record from MEASURE to GOVERN.
The control is the measurement-review owner, often the same person who maintains the measurement plan under MEASURE 1, with the review cadence documented and the artifacts retained.
Where a policy gateway sits in the MEASURE evidence chain
The MEASURE function produces artifacts. Several artifacts depend on operational data from the AI system's actual call traffic.
Performance measurement (MEASURE 2.2). The system's behavior on production traffic provides the empirical data. A gateway that records every request and response is the data source.
Safety and security measurement (MEASURE 2.3 and 2.4). The detection of safety-relevant events in production traffic, the logs of attempted attacks, and the recovery actions taken are operational outputs. A gateway that enforces policy and records decisions is the source.
Privacy measurement (MEASURE 2.7). The actual data flows in production, the data classes detected in prompts and responses, and the redaction actions taken are the empirical record. A gateway that classifies data and enforces redaction is the source.
Accountability measurement (MEASURE 2.5). The identity context of every call, the policy that governed it, and the outcome are the accountability evidence. A gateway that captures identity and policy at every call is the source.
Continuous monitoring (MEASURE 2.9). The operational record over time, with the controls running consistently and producing audit evidence. A gateway that operates continuously is the source.
The gateway is not the only source for the MEASURE function, but it is the source for the empirical operational evidence that the function depends on. A gateway that is absent or that does not produce auditable records leaves the MEASURE function dependent on application-generated logs, which is the self-attestation gap.
How the artifacts connect to the regulatory landscape
The NIST AI RMF is voluntary. The artifacts the framework produces are useful regardless of regulatory regime because they map cleanly to mandatory frameworks.
EU AI Act Article 11 technical documentation requires evidence of validity, reliability, accuracy, resilience, and cybersecurity. The MEASURE 2 artifacts satisfy this directly.
EU AI Act Article 12 automatic logging requires that high-risk AI systems record events over the lifetime of the system. The continuous monitoring under MEASURE 2.9 produces the same evidence.
EU AI Act Article 17 quality management system maps to the GOVERN function with MEASURE outputs feeding the QMS reviews.
For deployers operating under both NIST and EU regimes, the MEASURE artifacts are the multi-purpose evidence base. Producing them well once satisfies both regimes.
DeepInspect
DeepInspect is a stateless policy gateway between authenticated users or agents and any LLM. The gateway records every AI request and every AI response with the identity context, the policy version, the model version, the data classes detected, the redactions applied, the policy decisions made, and the timestamps. Records are signed and tamper-evident.
For a NIST AI RMF program, DeepInspect produces the operational evidence the MEASURE function depends on. The performance measurement, the safety and security record, the privacy controls evidence, and the accountability record all draw from the gateway logs. The continuous monitoring under MEASURE 2.9 is the gateway's normal operating mode. The MEASURE 4 effectiveness review reads the gateway records to assess whether the measurements are capturing the system's behavior.
If you are facing the August deadline, let's talk.
Frequently asked questions
- Is NIST AI RMF mandatory in the United States?
The NIST AI RMF is voluntary. Federal agency adoption is shaped by OMB Memorandum M-24-10, which directs federal agencies to manage AI risks consistent with the NIST framework. Federal contractors that supply AI systems to those agencies effectively inherit the requirement.
- How does the MEASURE function differ from MANAGE?
MEASURE produces the evidence. MANAGE consumes the evidence and acts on it. A measurement that shows the system has crossed a risk threshold triggers a MANAGE response (mitigation, escalation, or shutdown). The two functions operate in a loop: MEASURE feeds MANAGE; MANAGE actions get measured.
- What is the difference between MEASURE 2.4 and a security audit?
MEASURE 2.4 is the deployer's ongoing operational assessment of the system's security and resilience. A security audit is a point-in-time external review. MEASURE 2.4 produces the artifacts that the security audit will examine. The two are complementary; the audit verifies that the measurement is operating as documented.
- Does the framework prescribe specific metrics?
The framework is metric-agnostic at the high level. NIST has published companion documents that suggest metrics for specific characteristics: the AI RMF Generative AI Profile addresses generative AI specifically, and the AI Safety Institute Consortium has produced metric guidance. The deployer selects metrics appropriate to the system's risk profile and documents the selection under MEASURE 1.1.
- Can a deployer skip MEASURE for low-risk systems?
The framework supports proportional application. A low-risk system can run with a lighter MEASURE profile, with fewer subcategories actively tracked. The deployer documents the proportionality decision under GOVERN. The flexibility supports deployments where heavyweight measurement would be disproportionate to the risk.
- How does the GenAI Profile change the MEASURE function?
The Generative AI Profile adds specific subcategories that apply to generative AI systems, including risks specific to language models (hallucination, prompt injection, output toxicity), image and video models (manipulated media), and code generation models (insecure code, intellectual property). The MEASURE function for a generative system includes the Profile-specific subcategories in addition to the core RMF subcategories.