DeepSeek V4 Pro CAISI evaluation showing eight month US AI capability gap dispute.

DeepSeek V4 Pro Lags US AI by Eight Months, CAISI Says

The Center for AI Standards and Innovation, a unit of NIST, ranked DeepSeek V4 Pro about eight months behind US frontier models in its May 1 evaluation. Two of the nine benchmarks behind that verdict are private, the cost comparison was filtered down to a single OpenAI model, and Stanford’s separate Arena leaderboard puts the same gap at just 2.7 points.

The Washington headline is clean. The methodology, less so.

CAISI’s assessment, published on the NIST website, names DeepSeek V4 Pro the most capable Chinese model the institute has evaluated to date. It also concludes the open-weight flagship lags GPT-5.5 and Claude Opus 4.6 by roughly two-thirds of a year. Both findings are accurate within CAISI’s scoring system. The trouble starts when you read those rules.

How CAISI Got to Eight Months

Most benchmark trackers average percentage scores. CAISI did something different. The institute applied Item Response Theory, the same psychometric math that turns SAT raw scores into scaled scores, and treated each AI model as a test-taker. Easier questions count for less. Harder ones count more.

The output is a single Elo-style number meant to capture latent capability across nine tests in five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and math. CAISI used what its NIST evaluation page on DeepSeek V4 Pro calls a 1PL variant of IRT. The technique itself is well-grounded. The LLM-evaluation literature has been moving toward IRT-style scoring for at least a year, and a recent arXiv paper on IRT-based LLM benchmark aggregation argued that flat percentage averages hide what models actually know.

The Elo numbers landed in this order:

  • 1,260 Elo: GPT-5.5 (OpenAI)
  • 999 Elo: Claude Opus 4.6 (Anthropic)
  • 800 Elo (±28): DeepSeek V4 Pro (open weights)
  • 749 Elo: GPT-5.4 mini (OpenAI)

DeepSeek sits closer to OpenAI’s lightweight tier than to either frontier flagship. Map that 461-point spread against the dates GPT-5.4 mini and GPT-5.5 launched, and it translates to roughly eight months. That is where the headline came from.

The Two Tests Nobody Outside NIST Can Run

Two of the nine benchmarks in CAISI’s IRT engine cannot be replicated by anyone outside the institute. Both are CAISI-built: CTF-Archive-Diamond, an offensive-cybersecurity dataset adapted from public capture-the-flag challenges, and PortBench, a software engineering test. The institute’s broader cyber framework is partly visible in the usnistgov/caisi-cyber-evals public repository, which uses 650-plus rehosted CTF tasks, but the Diamond subset and PortBench are kept internal.

The gap between DeepSeek and GPT-5.5 is widest precisely on the unreproducible tests. On CTF-Archive-Diamond, GPT-5.5 scored 71% to DeepSeek’s 32%, a 39-point chasm that pulls heavily on the IRT aggregate. The benchmarks driving the eight-month verdict are the same benchmarks no independent researcher can run. Ray Perrault, co-director of Stanford’s AI Index steering committee, framed the broader benchmark problem in the index’s April release.

“We generally lack measures of how well a system or agent needs to function in a particular setting. Knowing that a benchmark for legal reasoning has 75 percent accuracy tells us little about how well it would fit in a law practice’s activities.”

Where Public Benchmarks Tell a Different Story

Pull the public-domain tests out of CAISI’s score and DeepSeek does not look eight months behind anything. On GPQA-Diamond, the PhD-level science reasoning test, DeepSeek scored 90% to Claude Opus 4.6’s 91%. A one-point spread.

On math olympiad benchmarks, DeepSeek lands inside the top tier. The model scored 97% on OTIS-AIME-2025, 96% on PUMaC 2024, and 96% on SMT 2025. On SWE-Bench Verified, the closest thing the field has to a real-world coding test, DeepSeek scored 74% to GPT-5.5’s 81%. A seven-point gap, not a generational one.

Side by side, the public scoreboard reads like this:

BenchmarkDeepSeek V4 ProClaude Opus 4.6GPT-5.5
GPQA-Diamond (science)90%91%not reported by CAISI
OTIS-AIME-2025 (math)97%matchedmatched
SWE-Bench Verified (coding)74%not reported by CAISI81%
CTF-Archive-Diamond (private cyber)32%not reported by CAISI71%

DeepSeek’s own V4 Pro model card on Hugging Face claims V4 Pro matches Opus 4.6 and GPT-5.4. CAISI does not directly contradict that on public tests. The institute simply weights its private cybersecurity and abstract reasoning data heavily enough that the average drags DeepSeek 200 Elo points downward.

Paul Triolo, a longtime US-China tech policy analyst who runs the AIStackDecrypted critique of the CAISI DeepSeek review, called the framing “comparing apples and oranges.” His point: CAISI tested deployed, safety-hardened OpenAI cloud APIs against raw DeepSeek model weights with no provider-side guardrails. The asymmetry is real. An open-weight model evaluated bare against a managed cloud product is not the same model class.

The Cost Comparison That Filtered Out Almost Everyone

CAISI’s cost section is where the numbers get strange. The institute set out to compare DeepSeek’s per-token price against US frontier models, then applied a filter: any US model that performed significantly worse or cost significantly more than DeepSeek was removed. After the filter ran, exactly one US model remained.

That model was GPT-5.4 mini. Not GPT-5.5. Not Opus 4.6. Not Gemini 3. The entire US AI industry, distilled to a single mini-tier comparator. Even against that opponent, DeepSeek came out cheaper on five of seven benchmarks, ranging from 53% less expensive to 41% more.

The filtered comparator list as published:

  • GPT-5.5: filtered out for being significantly more expensive per token.
  • Claude Opus 4.6: filtered out on the same cost basis.
  • GPT-5.4 mini: retained, the only surviving comparator.
  • Smaller open and closed US models: filtered out for performing significantly below DeepSeek.

That filter is what the AI ranking community keeps catching. The eight-month capability claim rests on the IRT scoring; the cost claim rests on a filtering rule that produces a sample size of one. A pricing test that excludes every model the comparison would normally test is a different document than its title suggests.

What Independent Trackers Show

Outside CAISI, other trackers paint a tighter race. The Artificial Analysis intelligence index entry on V4 Pro and V4 Flash places DeepSeek V4 Pro at 52 points, ranking it third out of 83 models in the index and second among open-weight reasoning models behind Kimi K2.6. OpenAI’s frontier sits near 60 points. The lead is real but compressed, especially compared with where the race stood a year ago.

https://x.com/ArtificialAnlys/status/2047547434809880611

Stanford’s 2026 AI Index report from Stanford HAI, released April 13, found a starker convergence on the Arena leaderboard. As of March 2026, Claude Opus 4.6 led China’s Dola-Seed-2.0 Preview by 39 Arena points, a 2.7% gap. In 2023, the same comparison ranged from 17.5 to 31.6 percentage points across MMLU, MATH, and HumanEval. The collapse has been near-vertical.

Why Two Honest Measurements Can Disagree

The CAISI and Stanford numbers are not in conflict because one of them is wrong. They are measuring different things. The Arena leaderboard ranks models on user-preference voting across general queries, which mirrors how most people actually use LLMs. CAISI ranks them on hard cybersecurity and abstract-reasoning tasks where current US frontier systems genuinely lead.

Both signals are useful and neither tells the full story. DeepSeek V4 Pro is competitive on math, science, and coding tests anyone can verify. It trails on aggressive cybersecurity probing where private CAISI evaluations show the largest spread. Artificial Analysis flagged a separate concern: V4 Pro’s hallucination rate hit 94%, meaning the model nearly always answers when it does not actually know. That rate dwarfs Western frontier models and barely surfaces in CAISI’s report.

The cost picture also looks different at scale. Artificial Analysis spent $1,071 to fully evaluate V4 Pro because the model is verbose by design, generating 190 million tokens during its testing run versus a 42-million-token median for comparable models. Cheap per token does not always mean cheap per task. Whether DeepSeek is eight months behind, two months behind, or essentially tied depends entirely on which tests you weight and which deployment you measure.

Frequently Asked Questions

Is DeepSeek V4 Pro free to download?

Yes. DeepSeek released V4 Pro under an MIT license on Hugging Face on April 24, 2026, the same permissive terms it used for V3. The full 1.6 trillion parameter weights are public, the download is roughly 865 GB, and any company can self-host with no royalties. Smaller V4 Flash weights (284B total, 13B active) are also released for users without enterprise GPU clusters.

How much does DeepSeek V4 Pro cost compared to ChatGPT?

DeepSeek V4 Pro’s API runs $1.74 per million input tokens and $3.48 per million output tokens. CAISI’s filtered comparison only kept GPT-5.4 mini, against which DeepSeek beat the price on five of seven benchmarks. Against Opus 4.6, which charges around $25 per million output tokens, DeepSeek is roughly seven times cheaper, though Opus tends to use fewer tokens per task.

Can I run DeepSeek V4 Pro on a laptop?

No. V4 Pro has 1.6 trillion total parameters with 49 billion active per token, requiring at least eight high-memory GPUs in parallel for full-precision inference. A laptop cannot host it. For local use, V4 Flash or community-distilled versions under 70 billion parameters run on workstations with two or three consumer GPUs, with reduced reasoning quality on hard math and coding tasks.

Is the US government planning to ban DeepSeek?

Not yet, and the May 1 CAISI report does not call for one. The evaluation frames DeepSeek as a capability benchmark, not a national security target. An earlier September 2025 CAISI assessment did flag safety and adversarial-prompt risks in DeepSeek’s open weights. Any restrictions would more likely target federal agency procurement first, not consumer access to the open model file.

CAISI says a fuller methodology paper is coming. Until that document arrives with the IRT item parameters and weighting choices, the eight-month verdict is a number readers cannot independently audit.

An earlier September 2025 CAISI evaluation of DeepSeek AI models flagged safety and adversarial-prompt risks rather than capability gaps, using different scoring entirely. The May 2026 update sharpens the capability story but loosens the audit trail. Both can be true at once, which is exactly why the conversation is still happening.