Six weeks apart this spring, China’s two most-watched AI labs each shipped a new flagship. DeepSeek dropped V4 on April 24 — 1.6 trillion parameters, MIT-licensed, weights on Hugging Face the same day. Alibaba answered on May 20 with Qwen3.7 Max, a closed-weight reasoning model with a million-token context and a price tag to match its ambition.
On paper they look like rivals. In practice they’re aimed at different buyers. One is the cheapest serious frontier model you can run yourself; the other is a polished, faster API you rent by the token. This piece breaks down where each actually wins — coding, reasoning, context, speed, and the part that ends most arguments, cost per million tokens.
Key takeaways
- Closely matched on coding. Vendor SWE-bench Verified scores land at 80.6% (DeepSeek V4-Pro) versus 80.4% (Qwen3.7 Max) — a rounding error apart.
- Qwen edges ahead on raw intelligence. Independent Artificial Analysis scores it 57 on its Intelligence Index against DeepSeek V4-Pro’s 52.
- DeepSeek is far cheaper. V4-Pro runs $0.435/$0.87 per million input/output tokens; Qwen3.7 Max is $2.50/$7.50 — roughly 6–9x more.
- Open vs closed is the real fork. DeepSeek V4 ships open weights you can self-host; Qwen3.7 Max is API-only, with no open release as of June 2026.
- Both claim a 1M-token context — but Qwen is meaningfully faster at ~193 tokens/sec versus DeepSeek’s ~80.
- Treat vendor benchmarks with caution. Several headline numbers are self-reported and not yet independently reproduced.
The two models at a glance
DeepSeek V4 actually ships in two sizes. V4-Pro is the heavyweight: 1.6 trillion total parameters with 49 billion active per token, built on a sparse Mixture-of-Experts design. There’s also V4-Flash, a 284B/13B model for cheaper, higher-throughput work. Both carry the headline 1M-token context and an unusually large 384K max output, and both are released under the permissive MIT license with weights available on Hugging Face.
Qwen3.7 Max is a different animal. Alibaba hasn’t disclosed its parameter count — independent observers estimate roughly a trillion total in a sparse MoE — and crucially, it’s closed-weight and API-only. There is no downloadable version as of June 2026, a notable departure from Qwen’s open-source heritage (the 3.6 line still ships open models like the 27B dense variant). Qwen3.7 Max is pitched squarely as a reasoning-and-agent model, leaning on extended chain-of-thought before it answers.
That framing matters for what follows. If you want to understand why both labs are pushing this hard, our explainer on DeepSeek’s rise covers the strategic backdrop.
| Spec | DeepSeek V4-Pro | Qwen3.7 Max |
|---|---|---|
| Released | Apr 24, 2026 | May 20, 2026 |
| Weights | Open (MIT, on Hugging Face) | Closed / API-only |
| Parameters | 1.6T total / 49B active (MoE) | Undisclosed (~1T est., MoE) |
| Context window | 1,000,000 tokens | 1,000,000 tokens |
| Max output | 384,000 tokens | ~65,000 tokens |
| Input price (/M) | $0.435 | $2.50 |
| Output price (/M) | $0.87 | $7.50 |
| Output speed | ~80 tokens/sec | ~193 tokens/sec |
Coding: a dead heat on the headline benchmark
The benchmark everyone checks first is SWE-bench Verified, the human-filtered set of real GitHub issues. Here the two models are effectively tied: DeepSeek’s top configuration (sometimes labeled V4-Pro-Max) reports 80.6%, while Qwen3.7 Max reports 80.4%. That gap is noise.
Dig one layer down and the picture diverges by task type. DeepSeek posts eye-watering numbers on competitive-programming-style coding — 93.5 on LiveCodeBench and a 3,206 Codeforces rating — which lean on algorithmic puzzle-solving. Qwen’s strengths skew toward agentic, multi-step engineering: it claims 60.6 on the harder SWE-bench Pro and 69.7 on Terminal-Bench 2.0, benchmarks that reward a model for navigating a repo, running commands, and iterating rather than one-shotting a function.
The practical read: for autonomous “fix this codebase” agent loops, Qwen3.7 Max has a slight edge; for raw code generation and competitive-style problems, DeepSeek is at least its equal and costs a fraction as much. Neither is the open-weight value champion for local setups, though — that crown still belongs to smaller models covered in our best local LLM for coding guide.
One caveat worth repeating: most of these figures are vendor-reported. As of June 2026, independent reproductions are thin, and the U.S. CAISI (NIST) evaluation of V4-Pro concluded its real-world capability trails the leading U.S. systems by roughly eight months overall. Read the marketing scores as a ceiling, not a guarantee.
Reasoning and general intelligence
For an apples-to-apples take, the most useful neutral reference is Artificial Analysis, which runs its own composite Intelligence Index. There, Qwen3.7 Max scores 57 (a top-ten placement among 150-plus models tracked) versus 52 for DeepSeek V4-Pro in its max-reasoning configuration. Qwen comes out ahead, but both sit comfortably in frontier territory.
On individual reasoning tests the vendors trade blows. Qwen3.7 Max reports 92.4 on GPQA Diamond, a graduate-level science benchmark; DeepSeek’s V4-Pro claims around 90 on the same test. Both labs tout near-perfect scores on hard math contests like HMMT and AIME 2026 when allowed to use tools and extended thinking — numbers that say more about test-time compute than baseline ability.
There’s a subtler difference in behavior. Qwen3.7 Max was tuned to abstain more often on questions it isn’t sure about, which gave it the lowest hallucination rate among frontier models on Qwen’s own reporting (around 22.9%) but also lowered raw recall accuracy on pure-knowledge benchmarks. If your application is retrieval-augmented and you’d rather a model say “I don’t know” than confabulate, that’s a feature. If you want it to always take a swing, it’s a quirk to plan around.
Context, speed, and the verbosity tax
Both models advertise a 1M-token context window, and both back it with reworked long-context attention — third-party reviewers reported solid recall from Qwen out past the 800K mark. For whole-repository reasoning or feeding in a stack of long documents, either will hold the material.
Speed is where they separate. Qwen3.7 Max streams at roughly 193 tokens per second in independent testing; DeepSeek V4-Pro manages about 80. DeepSeek’s time-to-first-token is actually quicker (around 1.87s versus Qwen’s 2.65s), so DeepSeek feels snappier to start, but Qwen finishes long generations much faster.
Both are also notably verbose. Running the Artificial Analysis Intelligence Index, DeepSeek V4-Pro burned through 190M output tokens and Qwen3.7 Max 97M — both well above the field, with DeepSeek among the most token-hungry models tested. That verbosity compounds with output pricing — and since output tokens are the expensive ones, a chatty reasoning model can quietly inflate your bill well beyond what the per-token sticker suggests.
Price: where the gap becomes a chasm
This is the cleanest win on the board, and it goes to DeepSeek.
| Model | Input /M | Output /M | Cache read /M | AA blended /M |
|---|---|---|---|---|
| DeepSeek V4-Pro | $0.435 | $0.87 | ~$0.004 | $0.18 |
| DeepSeek V4-Flash | $0.14 | $0.28 | ~$0.003 | — |
| Qwen3.7 Max | $2.50 | $7.50 | ~$0.25 | $1.43 |
DeepSeek V4-Pro is roughly six times cheaper on input and nearly nine times cheaper on output than Qwen3.7 Max. Drop to V4-Flash and the gap widens to the point of absurdity for high-volume chat or classification work. DeepSeek’s cache-hit pricing is also brutally aggressive — near $0.004 per million on repeated prefixes, a ~99% discount that makes long, stable system prompts almost free.
Qwen does offer prompt caching too (cache reads around $0.25/M, a 90% cut), and on Artificial Analysis’s blended metric the effective gap narrows to about 8x rather than the headline 9x. But there’s no reading of these numbers where Qwen comes out cheap. You pay for the extra speed and the few extra Intelligence Index points.
Which one should you actually run?
Pick DeepSeek V4 if…
- You want open weights you can self-host, fine-tune, or run air-gapped under MIT.
- Cost is the deciding factor — it’s 6–9x cheaper, before its huge cache discount.
- You need the longest outputs (up to 384K tokens) for big generation jobs.
- Your workload is competitive-style coding or math.
Pick Qwen3.7 Max if…
- You want the highest measured general intelligence of the two and don’t mind paying.
- Throughput matters — it generates output more than 2x faster.
- You’re building agentic, multi-step engineering loops where it edges ahead.
- You prefer a managed, closed API and lower hallucination over self-hosting.
For most teams the choice is really a budget-and-control question, not a capability one. They are close enough on quality that the open-vs-closed and cheap-vs-premium axes decide it. If you’re also weighing Western options, see how the field stacks up in our GPT-5 vs Claude 4 vs Gemini 3 breakdown, and our DeepSeek vs ChatGPT comparison covers the cross-border value gap in more depth.
FAQ
Is DeepSeek V4 or Qwen3.7 Max better for coding?
They’re essentially tied on SWE-bench Verified (80.6% vs 80.4%). DeepSeek looks stronger on competitive-programming benchmarks like LiveCodeBench and Codeforces, while Qwen3.7 Max claims an edge on agentic engineering tasks such as SWE-bench Pro and Terminal-Bench. For most coding work either is more than capable.
Which model is cheaper to use?
DeepSeek V4 is dramatically cheaper. V4-Pro costs $0.435/$0.87 per million input/output tokens versus Qwen3.7 Max at $2.50/$7.50 — roughly 6–9x less. DeepSeek’s V4-Flash variant and aggressive cache pricing widen the gap further for high-volume use.
Can I download and self-host these models?
DeepSeek V4 (both Pro and Flash) ships with open weights under the MIT license on Hugging Face, so you can self-host and fine-tune it. Qwen3.7 Max is closed-weight and API-only as of June 2026, with no downloadable version available.
Do both really support a 1-million-token context window?
Yes, both advertise a 1M-token context. DeepSeek also supports up to 384K output tokens, while Qwen3.7 Max caps output around 65K. Independent reviewers reported strong long-context recall from Qwen past the 800K mark.
Which is faster?
Qwen3.7 Max streams output faster — roughly 193 tokens/sec versus about 80 for DeepSeek V4-Pro in independent testing. DeepSeek has a slightly lower time-to-first-token, so it begins responding sooner, but Qwen completes long generations more quickly.
Are the benchmark scores trustworthy?
Treat them carefully. Many headline figures are vendor-reported and not yet independently reproduced. Neutral aggregators like Artificial Analysis give Qwen3.7 Max a higher composite Intelligence Index (57 vs 52), and a U.S. government evaluation (CAISI/NIST) found DeepSeek V4-Pro trails the leading U.S. models by about eight months overall.
Is Qwen3.7 Max actually smarter than DeepSeek V4?
On independent composite scoring, marginally — 57 vs 52 on the Artificial Analysis Intelligence Index. The difference is real but small, and it comes at a large price and openness cost. Whether those few points justify paying ~8x more depends entirely on your use case.
Bottom line
These two models are closer than the hype suggests. On the benchmark that matters most for engineers — SWE-bench Verified — they’re tied, and on general intelligence Qwen3.7 Max leads by a slim, independently confirmed margin. If quality alone decided it, Qwen would win on points.
But quality rarely decides it alone. DeepSeek V4 is open-weight, MIT-licensed, and 6–9x cheaper, which makes it the default for anyone who cares about cost, control, or running models on their own hardware. Qwen3.7 Max is the pick when you want the slightly smarter, much faster managed API and the budget isn’t the constraint. Most teams will reach for DeepSeek and only notice what they’re missing on the hardest agentic tasks — if they notice at all.
