The first honest benchmark for AI agents just landed. China is ten points behind, not ten years.

UniPat's SaaS-Bench tested 14 frontier models on real professional work across 23 production SaaS systems. Claude Opus 4.7 led with 43.9 percent. Kimi K2.6 finished fifth at 34.1 percent. The compute story underneath that gap is the one Brazil should be reading.

UniPat AI published SaaS-Bench yesterday, the first benchmark that puts computer-use agents inside 23 real, deployable software-as-a-service systems and asks them to complete 106 professional workflows — finance closeouts, healthcare merge audits, regression test execution, cross-team report distribution — averaging more than 100 interaction steps each [UniPat AI, May 25, 2026]. The benchmark is the first to evaluate agents on the kind of work that runs the actual back office of a modern company, rather than on toy websites and isolated single-page tasks.

The headline numbers are sobering for the agentic-AI bull case, and informative for everyone watching the materials economy underneath AI compute.

What’s happening

The top model, Claude Opus 4.7, scored 43.9 percent on the weighted checkpoint metric and resolved only 3.8 percent of tasks end to end [UniPat AI, May 25, 2026]. Agents can make progress on professional workflows. They almost never finish them.
The leaderboard reads, in order: Claude Opus 4.7 (43.9), GPT-5.5 High (43.8), Claude Opus 4.6 (43.2), GPT-5.4 High (37.0), Kimi K2.6 (34.1), Qwen 3.6 Plus (29.9), Kimi K2.5 (27.7), Gemini 3.1 Pro (27.1), Doubao Seed 2.0 Pro (27.1), Gemini 3.5 Flash High (23.3), Claude Sonnet 4.6 (23.3), DeepSeek V4 Pro (21.5 on text-only), GLM-5.1 (17.4 on text-only), MiniMax M2.7 (15.8 on text-only) [UniPat AI, May 25, 2026].
Four Chinese-trained models cluster in the top ten: Kimi K2.6 (Moonshot) at fifth, Qwen 3.6 Plus (Alibaba) at sixth, Kimi K2.5 (Moonshot) at seventh, Doubao Seed 2.0 Pro (ByteDance) at ninth. DeepSeek V4 Pro, GLM-5.1 (Zhipu), and MiniMax M2.7 are text-only entrants that finished further back.
The paper identifies four structural failure modes: long-horizon completion fragility (one wrong field in an eighty-percent-correct trajectory zeros out the resolved score), error cascading across applications, agents declaring success without re-verifying observable outcomes, and high single-run variance (the same Claude Sonnet 4.6 run scored 0.00 and 0.68 on identical tasks).
Allowing pass at three rather than pass at one raised partial scores by roughly eight points but did not close the resolved-score gap. Reliability, not capability, is the binding constraint.

Brazil angle

The Brazilian read on this is the one Brasília has not yet written down.

Every agent on this leaderboard, when deployed at scale, will need to run on physical compute infrastructure that depends on copper, nickel, rare-earth magnets, and gallium for the chips, niobium-bearing steel for the transmission lines that move the electricity in, and helium for the lithography that prints the next generation of accelerators. Brazil supplies, at minimum, ninety percent of global niobium through CBMM in Araxá [USGS Mineral Commodity Summaries 2025], holds the third-largest known rare-earth reserves with Serra Verde producing commercially since 2024, has the Lithium Valley in Vale do Jequitinhonha targeting roughly twenty percent of global spodumene capacity by 2030, and sits on under-explored helium potential in the Solimões and Parnaíba basins [ANP records, public domain].

The SaaS-Bench numbers reframe how to read this exposure. If Chinese models are ten percentage points behind the US frontier on professional-grade agentic tasks, and the gap is narrowing every quarter, then the agentic-AI buildout is not going to bifurcate cleanly along the US-China line that the chip-export-control framework assumes. Four of the top ten models on SaaS-Bench are Chinese. Their compute will be procured, and the materials underneath that compute will be procured somewhere, and Brazil is a more accessible supplier for both blocs than almost any alternative producer in either jurisdiction’s preferred geography.

The structural opportunity is to be the materials supplier neither bloc can sanction the other out of. The structural risk is to remain a raw exporter while the value migrates downstream to whoever processes, refines, and integrates.

US angle

Washington’s strategic logic in the chip-export-control regime assumes a multi-year gap between US and Chinese frontier capability that justifies near-term unilateral restrictions on the assumption that China will not catch up before the controls work. SaaS-Bench, with its honest evaluation of real workflow completion, is the kind of public data that erodes that assumption.

Anthropic and OpenAI hold a real lead on the most demanding tasks: Claude Opus 4.7 outscored every Chinese entry by roughly ten checkpoint points, and the resolved-score gap is wider in percentage terms because resolved scores compound. But Kimi K2.6 at 34.1 percent overall, with 50.1 percent on agriculture and 39.5 percent on media — categories that map cleanly to Chinese commercial use cases — is competitive enough that enterprise buyers outside the US security envelope have a credible alternative when they need one.

For the US Defense Production Act and IRA-linked critical-minerals push, the read is that the demand-side scenario the buildout is sizing for (every enterprise deploys frontier-only agents) is less probable than the scenario where multiple credible model families coexist and procurement diversifies across them. That increases, not decreases, total materials demand. It also distributes that demand across more geographies of compute, which makes diversification of the supply base more rational on the policy timeline, not less.

China angle

Moonshot’s release cadence on Kimi tells its own story. K2.5 (February 2026) scored 27.7 percent on SaaS-Bench. K2.6 (April 2026) scored 34.1 percent. A six-point gain in two months on a real-workflow benchmark is rapid by any frontier-model standard. Qwen 3.6 Plus from Alibaba and Doubao Seed 2.0 Pro from ByteDance are within three points of each other and within four points of Kimi K2.6.

These are not laboratory milestones. They are the models that Chinese enterprise customers, and increasingly customers in Southeast Asia, the Gulf, and selectively in Latin America, will be deploying on real workflows over the next twelve to twenty-four months. Each deployment is a procurement decision for accelerators, memory, networking, and the upstream materials that go into them.

The MIIT framing of rare-earth and gallium export controls as permanent national-security policy [Reuters, May 20, 2026] now reads less as leverage and more as foundation. Beijing is establishing the conditions under which both blocs’ agent buildouts have to source from a constrained set of producers.

What it means

Three implications for the desk’s coverage.

One. The investable agentic-AI buildout is not a single-bloc story. Capability convergence between the US frontier and the best Chinese models means materials demand will be procured across multiple compute supply chains rather than one. That favors materials with a clean, neutral supply geography over those with concentrated single-country exposure. SDX (the desk’s Southern Diversification Index, closing the week of May 22 at 96.1) is positioned for exactly that thesis and the SaaS-Bench data strengthens, not weakens, the long-SDX case.

Two. Reliability, not raw capability, is the binding constraint on agent adoption. That favors infrastructure that supports redundancy, retry, and ensemble strategies, which favors more compute per task, not less. The Tantalum AI Materials Index (TAI, 102.4 at the May 22 close) tracks the materials side of that buildout. The structural read continues to be that compute-per-deployed-agent scales faster than agent-deployment count, because reliability requires it.

Three. The agentic-AI bubble debate is asking the wrong question. Whether models are capable enough to replace knowledge work is not the binding question. Whether they are reliable enough at scale to deploy without continuous human verification is. The answer per SaaS-Bench is no, not yet, and the path to yes runs through more compute and better tooling, not different models. The materials underneath that path are the desk’s coverage.

What to watch

Moonshot Kimi K3 release timing. The K2.5 to K2.6 jump was two months. If K3 lands by August 2026 with another six to eight point gain, the resolved-score gap closes meaningfully and the procurement-diversification thesis hardens.
CBMM Q3 2026 capacity utilization disclosure. Niobium demand from grid-buildout steel is the most direct read on whether US and Chinese AI infrastructure spending is converting into real construction. CBMM’s annual report is published in March; quarterly capacity utilization should appear by October.
USA Rare Earth (NASDAQ: USAR) loan-package closing. The non-binding LOI for roughly US $1.6 billion in federal financing for the Round Top heavy-rare-earth deposit and integrated NdFeB magnet capacity, disclosed January 2026 [USAR SEC filing, January 2026], is the bellwether for whether the Pentagon’s procurement diversification thesis is being capitalized at scale or remains rhetorical.

Tantalum Strategy