Is GPT-4-class compute over or under 10^25 FLOPs?

Public estimates place the GPT-4 pretraining run on the order of 2 to 3 x 10^25 FLOPs, above the threshold. Open-weight frontier models from leading labs are also generally above the threshold. Industry trends since the Act was drafted have moved more models into the systemic-risk band than the original drafters anticipated.

Can the Commission lower the threshold?

Yes. Article 51(3) gives the Commission delegated-act authority to adjust the threshold downward as the field advances. The expectation is that the threshold will move down as training compute costs decrease.

What happens if a provider crosses the threshold mid-life of a model?

The Article 51 trigger applies at the moment the model is placed on the market or significantly modified. A model that crosses the threshold through fine-tuning, RLHF, or other post-pretraining work may inherit systemic-risk status at the modification point. Providers are obligated to estimate cumulative compute honestly.

Do open-source systemic-risk models get the Article 54 exemption?

No. Article 54(2) explicitly carves systemic-risk models out of the open-source exemption. A model classified as systemic-risk inherits the full Article 53 and 55 obligations regardless of license.

How does the AI Office investigate a serious incident?

The AI Office can request information from the provider, request access to the model and documentation where reasonably necessary, and coordinate with national competent authorities. Provider cooperation obligations are set out in Articles 88 through 94. Deployer cooperation obligations attach through the high-risk provisions when the incident touches a high-risk deployment.

EU AI Act Systemic-Risk Models: How the 10^25 FLOPs Threshold Triggers Article 55 Obligations

The EU AI Act treats a defined subset of general-purpose AI models as systemic-risk and attaches a heavier obligation set to their providers. The classification mechanics are in Article 51, the threshold benchmark sits at 10^25 floating-point operations of training compute, and the obligation expansion runs through Article 55. Together those provisions are the part of the Act that lands directly on the labs training frontier models and lands second-order on the enterprise deployers integrating those models into regulated applications.

I want to walk through how the threshold actually works, the Commission designation pathway that operates alongside it, and the obligations that flow to deployers who buy inference from a systemic-risk model.

The compute threshold

Article 51 defines two routes by which a GPAI model is classified as having systemic risk. The first route is the compute trigger: a model is presumed to have high-impact capabilities, and therefore systemic risk, when the cumulative amount of computation used for its training measured in floating-point operations is greater than 10^25.

The Commission can update the threshold downward by delegated act as the field advances. The threshold was set at a level intended to capture frontier-scale training runs as of the time of drafting. Industry compute trends since drafting have made the 10^25 mark less exclusive than originally projected, which is part of why the delegated-act revision power exists.

The second route is Commission designation. Article 52 lets the Commission designate a model as systemic-risk based on factors beyond compute: the number of parameters, the quality or size of the dataset, the input and output modalities, the benchmarks and evaluations of capability, the number of registered end users, the scalability and adaptability. Designation can attach to models that fall below the FLOPs threshold but exhibit other systemic-risk characteristics.

How a model crosses the threshold

The 10^25 trigger looks at cumulative training compute. For a typical frontier model trained in a single pass, the calculation is the chip-hours used in pretraining multiplied by the per-chip FLOPs throughput. For a model trained in multiple stages (pretraining plus instruction tuning plus reinforcement learning from human feedback), the cumulative figure adds across stages.

Compute estimation has industry-standard methods. Hardware utilization on H100 or H200 class accelerators is typically reported at 40 to 55 percent of theoretical peak across long training runs. Reported total compute in academic and industry papers usually expresses the figure in FLOPs or PetaFLOP-days. A model trained at the 10^25 boundary corresponds to roughly 10^10 PetaFLOP-seconds, which translates to weeks to months of training on a several-thousand-chip cluster.

The provider has the obligation to estimate the figure honestly. A provider that knew or should have known a model crossed the threshold and shipped it without Article 55 compliance creates compliance exposure even if the public figure was below the threshold.

Article 55 obligations

Article 55 adds four obligation categories on top of Article 53. The provider of a systemic-risk model must perform model evaluation in accordance with standardized protocols and tools reflecting current technical practice, including conducting and documenting adversarial testing of the model with a view to identifying and mitigating systemic risks. The provider must assess and mitigate possible systemic risks at the Union level, including their sources, that may stem from the development, the placing on the market, or the use of general-purpose AI models with systemic risk. The provider must keep track of, document, and report relevant information about serious incidents and possible corrective measures to address them to the AI Office and, as appropriate, to national competent authorities. The provider must ensure an adequate level of cybersecurity protection for the general-purpose AI model with systemic risk and the physical infrastructure of the model.

The evaluation obligation is the most resource-intensive. Standardized protocols are being developed through the Code of Practice process. The reference points include published red-team methodologies, evaluation benchmarks for dangerous capabilities, and the documentation format that lets the AI Office reproduce or audit the evaluation.

The serious-incident reporting channel is short. Reporting timelines are being set through implementing acts but the working text references days, not weeks, for incidents that meet the seriousness threshold.

What counts as a serious incident

The Act defines a serious incident in Article 3(49) as an incident or malfunctioning of an AI system that leads to a serious risk to the health or safety of persons, a serious infringement of obligations under Union law intended to protect fundamental rights, a serious or irreversible disruption of the management or operation of critical infrastructure, a serious financial damage, or environmental damage.

For systemic-risk GPAI models, the reportable categories also include serious cybersecurity incidents against the model or its supporting infrastructure, evidence of significant capability gain that was not previously assessed, and mass misuse incidents at scale that emerge in post-market monitoring.

Second-order obligations on deployers

A deployer integrating a systemic-risk model into a downstream application does not directly carry Article 55 obligations. The deployer carries Article 26 obligations and the post-market monitoring obligations that flow through Article 72 (post-market monitoring system) and Article 73 (reporting of serious incidents by providers of high-risk AI systems).

The second-order effect is that the deployer is the principal observer of certain incident categories. A deployer running a frontier model at enterprise scale sees prompt-injection patterns at scale before the upstream provider does. The deployer sees identity-spoofing patterns and mass-egress patterns first because they manifest at the request boundary the deployer controls.

For the upstream provider to discharge Article 55 incident reporting, the deployer's monitoring has to surface the candidate incident in a form the provider can act on. That requires the deployer's logging at the AI request layer to capture prompt classification, response classification, identity context, and decision outcome with sufficient granularity that an upstream review can reconstruct the incident.

Why this matters for August 2 readiness

Article 55 obligations attach to systemic-risk models on August 2, 2026, with the Article 111 transition relief for models placed on the market before that date pushing some obligations to August 2, 2027. For models trained and shipped after the enforcement date, the full set applies on day one.

A deployer that integrates a model the upstream provider has not yet classified as systemic-risk is not absolved of the second-order obligations. The deployer's logging and monitoring posture has to be sufficient to support upstream incident reporting regardless of how the upstream provider's compute estimate sits relative to the threshold. The cheapest position is to instrument the deployer side at the level Article 55 would require for systemic-risk traffic, then dial back if the upstream classification proves otherwise.

DeepInspect

This is the gap DeepInspect closes for the deployer monitoring a systemic-risk or near-threshold model. DeepInspect sits at the AI request boundary as an external enforcement layer, evaluates every request against identity and policy, and writes a per-decision audit record outside the calling application. The record captures the fields the Article 19 deployer obligation requires (timestamp, principal identity, data classification, policy applied, decision outcome) and the additional context that lets upstream providers reconstruct candidate Article 55 incidents (prompt classification, response classification, repeated injection signal, identity-propagation anomalies, tool-invocation chain).

The architecture is identity-aware: every request carries the upstream principal's identity assertion and the agent identity, which is the data the provider needs when investigating whether an incident pattern is concentrated in one operator or distributed across many. The audit record commits before the response returns to the application, which means the evidence persists even if the application fails between the model response and its own logging.

If you are integrating a foundation model in scope of Article 55 and the August 2 deadline is approaching, let's talk.