Frontier AI Testing: 40 Secret Government Evaluations

Every frontier AI model released in the United States will soon pass through a government testing lab before reaching the public. On May 5, 2026, the Center for AI Standards and Innovation (CAISI) announced new agreements with Google DeepMind, Microsoft, and xAI that grant federal evaluators pre-release access to frontier AI testing environments, including versions of models with safety guardrails deliberately removed. Combined with renegotiated deals with OpenAI and Anthropic, CAISI now has testing relationships with every major frontier AI developer in the country.

This is not a hypothetical framework. CAISI has already completed more than 40 classified evaluations on unreleased models. The question for every enterprise buyer, developer, and investor: what does the government find when it tests AI with the safety filters turned off?

What CAISI Actually Does
The Five Companies Under the Microscope
What the TRAINS Taskforce Tests For
Why the Trump Administration Reversed Course
The Mythos Factor
What This Means for Enterprise AI Buyers
How This Compares to Global AI Governance
What Happens Next
FAQ

What CAISI Actually Does

CAISI sits inside the National Institute of Standards and Technology (NIST), under the Department of Commerce. Its core function is straightforward: evaluate AI models for national security risks before they ship to the public.

The process works like this. A frontier AI developer sends CAISI a version of its latest model, often with reduced or completely removed safeguards. Government evaluators then probe the model in both unclassified and classified environments, searching for capabilities that could be weaponized. The results feed back to the developer, who can then adjust the model before release.

This is not a checkbox exercise. CAISI has completed over 40 evaluations, including on models that were never released to the public. Some of those models presumably failed evaluations badly enough that they were modified or shelved entirely.

The agreements announced May 5 formalize relationships with Google DeepMind, Microsoft, and Elon Musk’s xAI. Previously, only OpenAI and Anthropic had formal testing arrangements with the agency (dating back to 2024). Those earlier agreements have been renegotiated to align with CAISI’s updated mandate under President Trump’s AI action plan.

The Five Companies Under the Microscope

Every major U.S. frontier AI developer now participates in CAISI’s testing program:

Google DeepMind will provide Gemini models and future frontier systems for pre-release evaluation. Google’s blog post confirmed they will collaborate on improving evaluation methodologies alongside the testing itself.

Microsoft committed to joint work on adversarial assessment methods, specifically testing AI systems for unexpected behaviors, misuse pathways, and failure modes. Microsoft is also coordinating with the UK’s AI Security Institute through a parallel agreement.

xAI (Elon Musk’s AI company) will submit Grok models for evaluation. This is notable because Musk has been one of the loudest critics of government AI regulation while simultaneously calling for AI safety measures.

OpenAI and Anthropic both renegotiated their original 2024 agreements. The updated memoranda of understanding reflect CAISI’s expanded scope and the Commerce Department’s new directives.

The result: no frontier model from a U.S. developer reaches the public without at least the option of a government review. Whether that review is mandatory or voluntary is the next policy question being debated.

What the TRAINS Taskforce Tests For

The actual testing is run by the TRAINS Taskforce (Testing Risks of AI for National Security), a cross-agency group that includes experts from the Department of Defense, Department of Energy, Department of Homeland Security, and the National Institutes of Health.

They focus on three categories of risk:

Cybersecurity capabilities. Can the model discover, exploit, or chain together software vulnerabilities at a speed or scale that would give an attacker a decisive advantage? This includes testing for zero-day exploit generation, automated penetration testing, and the ability to write malware that evades current detection systems.

Biosecurity risks. Can the model provide actionable instructions for creating biological agents that a trained scientist could not easily find through existing published research? The threshold here is “uplift,” meaning capability that goes meaningfully beyond what is already publicly available.

Military misuse potential. Can the model be used for autonomous targeting, weapons system control, or intelligence analysis in ways that bypass human oversight? This category has expanded significantly since the Pentagon’s dispute with Anthropic over autonomous weapons restrictions.

The key operational detail: developers hand over models with guardrails stripped. CAISI is not testing the safety filters. They are testing the raw model’s capability, asking what this system can do if someone bypasses or removes every safety measure.

Why the Trump Administration Reversed Course

This policy represents a 180-degree turn. When President Trump took office, one of his earliest actions was revoking the Biden administration’s executive order on AI risks. The message was clear: deregulation, not oversight.

David Sacks, the administration’s first “AI czar,” led the push to remove barriers for AI companies. The philosophy was that American AI companies needed maximum freedom to out-compete China.

Three things changed.

First, Sacks left the role in March 2026. His departure created a power vacuum that White House Chief of Staff Susie Wiles and Treasury Secretary Scott Bessent filled with a more cautious approach.

Second, the Pentagon’s confrontation with Anthropic exposed a concrete risk. When the Department of Defense designated Anthropic a “supply chain risk” (a label previously reserved for foreign adversaries), it forced the administration to reckon with the fact that a single AI model (Claude) had been the only AI running on Pentagon classified networks. Dependence on any single AI vendor, the Pentagon concluded, was itself a national security vulnerability.

Third, Anthropic’s Mythos model demonstrated capabilities that made the abstract threat of AI misuse suddenly concrete.

The Mythos Factor

Anthropic’s Project Glasswing and the Mythos model it powers changed the conversation in Washington. Mythos demonstrated an ability to identify and exploit cybersecurity vulnerabilities at a scale that alarmed intelligence officials.

According to reporting from Tom’s Hardware and Fortune, Mythos was the specific catalyst cited by administration officials when discussing why pre-release testing was necessary. The logic was brutally simple: if an American AI company could build a model this capable at finding security flaws, a Chinese or Russian model could do the same, and the U.S. government needed to understand what these models could do before adversaries figured it out.

The irony is thick. The administration blacklisted Anthropic, designated it a supply chain risk, and attempted to ban it from federal contracts. Then the very capabilities Anthropic built prompted the administration to embrace the kind of oversight Anthropic had been advocating for since its founding.

What This Means for Enterprise AI Buyers

For enterprise leaders evaluating AI vendors, CAISI’s testing program creates a new layer of due diligence information, even if the test results themselves remain classified.

Vendor credibility signal. If a vendor’s models have passed CAISI evaluation, that is a meaningful data point for procurement decisions, especially in regulated industries. Expect vendors to start referencing their CAISI participation in sales materials.

Deployment timeline implications. Pre-release government testing adds time to the release cycle. If CAISI identifies issues that require model modifications, the gap between a model’s internal readiness and its public availability could grow. Enterprise buyers planning around specific model releases should build buffer time into their roadmaps.

Compliance positioning. For organizations in finance, healthcare, defense, or critical infrastructure, using models that have undergone government security evaluation provides a compliance argument that untested models cannot match. This matters as AI regulation accelerates across states.

The open source question. CAISI’s agreements cover the five major U.S. frontier developers, but they do not cover open source models from non-U.S. entities. DeepSeek and other Chinese models operate entirely outside this framework. Enterprise buyers using open source models should understand that those models carry no equivalent government review.

How This Compares to Global AI Governance

The U.S. is not the only country building pre-release AI testing infrastructure.

The UK’s AI Security Institute has been running similar evaluations since 2024 and recently signed a parallel agreement with Microsoft. The EU’s AI Act requires conformity assessments for high-risk AI systems, though the implementation timeline stretches into 2027. China mandates government review of generative AI models before public release through its Interim Measures for the Management of Generative AI Services.

What makes the U.S. approach distinct is the classified testing environment. No other country has publicly acknowledged testing AI models in classified government facilities with stripped guardrails. The TRAINS Taskforce’s cross-agency structure (Defense, Energy, Homeland Security, NIH) also reflects a uniquely broad definition of “national security risk” that extends into biosecurity and public health.

The practical effect: frontier AI models released in the U.S. will have undergone the most rigorous government security evaluation anywhere in the world. Whether that translates into safer models or simply better-informed government intelligence remains an open question.

What Happens Next

The White House is reportedly drafting an executive order that could make pre-release AI testing mandatory rather than voluntary. An “AI working group” of tech executives and government officials is being formed to define the scope and procedures.

The key tension: every additional day a model spends in government testing is a day it is not generating revenue or advancing the company’s competitive position against Chinese rivals. The administration that revoked Biden’s AI order because it slowed innovation is now building its own oversight apparatus that could create similar delays.

For now, the agreements are voluntary. All five companies have signed because the alternative (being seen as unwilling to submit to national security review) carries its own risks, both regulatory and reputational. But the difference between “voluntary” and “mandatory” in this context is largely semantic. No frontier AI company can afford to be the one that refused government security testing.

The 40 evaluations already completed prove the system works. The question is whether it scales: as model releases accelerate and capabilities compound, can a government taskforce keep pace with an industry spending $145 billion on GPUs and racing toward artificial general intelligence?

FAQ

What is CAISI and what does it do?

CAISI (Center for AI Standards and Innovation) is a division within NIST, under the Department of Commerce. It evaluates frontier AI models for national security risks, including cybersecurity, biosecurity, and military misuse capabilities. CAISI has completed over 40 evaluations, including on models that were never publicly released.

Which AI companies participate in frontier AI testing?

As of May 2026, five companies have formal agreements: Google DeepMind, Microsoft, xAI, OpenAI, and Anthropic. This covers every major U.S. frontier AI developer.

Is government AI testing mandatory?

Currently, the agreements are voluntary. However, the White House is reportedly drafting an executive order that could make pre-release testing mandatory for frontier AI models. An AI working group is being formed to define the scope.

What does CAISI test AI models for?

The TRAINS Taskforce tests for three categories: cybersecurity capabilities (vulnerability discovery, malware generation), biosecurity risks (ability to provide actionable instructions for creating biological agents), and military misuse potential (autonomous targeting, weapons system control). Models are tested with safety guardrails removed.

How does U.S. AI testing compare to other countries?

The U.S. approach is unique in its use of classified testing environments and cross-agency evaluation through the TRAINS Taskforce. The UK runs parallel evaluations through its AI Security Institute. The EU requires conformity assessments under the AI Act. China mandates government review before release.

The U.S. Government Now Tests AI Models Before You See Them: What 40 Secret Evaluations Reveal About Frontier AI Testing

Table of Contents