Two of China’s most aggressive AI labs shipped new open-weight coding models within a day of each other this month. Moonshot pushed Kimi K2.7 Code on June 12; Zhipu (Z.ai) answered with GLM 5.2 on June 13. Both are giant Mixture-of-Experts models, both carry permissive licenses, and both are pitched squarely at the same job: long-horizon, agentic coding that doesn’t cost Claude or GPT money.
The twist is in how each lab handled benchmarks. Moonshot published a stack of first-party numbers for K2.7 Code on day one. Zhipu deployed GLM 5.2 to its Coding Plan tiers first with no benchmark table at all, then released a full benchmark set alongside the API and MIT open weights days later. So as of this writing, both models now have vendor-published coding scores — but neither has a deep bench of fully independent SWE-bench numbers yet, and Moonshot’s headline figures sit on proprietary in-house suites that practitioners have already started to question. Here’s how the two actually stack up, what we can verify, and what’s still a question mark.
Key takeaways
- Different shapes, same target. Kimi K2.7 Code is a 1T-param MoE with 32B active and 256K context; GLM 5.2 is ~744-753B total with ~40B active and a full 1M context.
- Both now have vendor benchmarks. Moonshot reports +21.8% on its own Kimi Code Bench v2 (62.0 vs 50.9) plus ~30% fewer reasoning tokens. Zhipu later published GLM 5.2 scores too — SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, FrontierSWE 74.4 — beating GPT-5.5 on several long-horizon suites. Treat both vendors’ numbers with caution until independent runs land.
- Pricing favors Kimi per token, GLM per month. Kimi is metered at $0.95 in / $4.00 out per million; GLM is metered around $1.40 in / $4.40 out, or a flat GLM Coding Plan from $10/mo (Lite).
- Both are genuinely open and commercial-friendly. GLM 5.2 is MIT; Kimi is Modified-MIT (commercial use allowed, with an attribution clause only if you exceed 100M MAU or $20M/month revenue).
- GLM drops into Claude Code cleanly. Z.ai exposes an Anthropic-compatible endpoint, so existing Claude Code / Anthropic-SDK agents work with a base-URL and key swap.
- Running the weights is not for laptops. 744B+ and 1T parameters mean multi-GPU servers or heavy quantization — most people will hit the cloud APIs first.
The 30-second version
If you want the longest context, the strongest published open-weight coding scores, MIT licensing, a flat monthly bill, and drop-in Claude Code compatibility, GLM 5.2 is the more complete package today. If you want the cheapest per-token rate, the best cache discount for token-heavy agent loops, and measured token-efficiency gains, Kimi K2.7 Code is the leaner buy. Both vendors’ benchmarks are first-party for now, and a single-task head-to-head gave GLM a slight edge — so anyone crowning a definitive winner this week is leaning on vendor marketing, not independent data.
Architecture and active parameters
These models are built on the same broad idea — a huge sparse MoE where only a fraction of parameters fire per token — but they tune it differently.
Kimi K2.7 Code is the bigger model on paper: 1 trillion total parameters with 32B active, drawn from 384 experts (8 routed plus 1 shared per token). That sparse activation is why a trillion-parameter model can serve at a sane price. GLM 5.2 is smaller in total (Z.ai’s docs cite ~753B, while trackers like vLLM read ~744B) but activates slightly more per token at ~40B, and it leans on a longer context plus a dual thinking-effort system — a “High” mode for routine work and a “Max” mode for harder architecture and debugging.
The practical read: Kimi’s larger expert pool may help with breadth of knowledge, while GLM’s higher active-parameter count and effort modes are aimed at depth on a single hard problem. The published benchmarks now tilt toward GLM on long-horizon engineering, but those are vendor-run, so treat the architectural story as supporting evidence rather than a verdict.
Context window: 1M vs 256K
This is the clearest, most verifiable difference. GLM 5.2 ships a genuine 1,000,000-token context (the glm-5.2[1m] variant) with output capped around 128K-131K tokens. Kimi K2.7 Code runs a 256K context (262,144 tokens) and a much smaller default output ceiling of 32,768 tokens.
For repo-scale agentic work — loading a large codebase, long plan-then-execute traces, multi-file refactors in one shot — GLM’s 1M window is a real advantage and matches what frontier open models like DeepSeek V4 and Qwen 3.6 Plus now offer. That said, 256K is still large, and in agentic loops most well-built tools retrieve and chunk context rather than stuffing the whole repo in. Bigger context helps; it isn’t automatically better code.
Coding benchmarks (and the honesty gap)
Here’s where you need to keep your skepticism switched on, because every headline number below is vendor-published.
Moonshot reports that K2.7 Code scores 62.0 on its in-house Kimi Code Bench v2, up 21.8% from K2.6’s 50.9, alongside gains on Program Bench and MCP-focused agentic suites and a ~30% cut in reasoning-token usage. These are specific claims — but they run on Moonshot’s own proprietary benchmarks, and at least one outlet (VentureBeat) has reported practitioners saying the numbers don’t fully check out in real use. Independent SWE-bench Verified or SWE-bench Pro figures for K2.7 Code were not available at the time of writing.
GLM 5.2 came out the other way around: it launched on Zhipu’s Coding Plan tiers with no benchmark table, then Z.ai published a full set alongside the API and open weights. Those scores are strong — SWE-bench Pro 62.1 (vs GPT-5.5’s 58.6 and GLM 5.1’s 58.4), Terminal-Bench 2.1 (Terminus-2) 81.0 (vs GPT-5.5’s 84.0), FrontierSWE 74.4% (vs GPT-5.5’s 72.6%), plus long-horizon wins on PostTrainBench (34.3 vs 28.4) and SWE-Marathon (13.0 vs 12.0). Several of those were run by outside evaluators (Proximal, the PostTrainBench team, Abundant AI), but they’re surfaced and curated by Z.ai, so treat them as vendor-published rather than fully independent. The takeaway: GLM 5.2 posts the stronger open-weight coding numbers on paper, while still trailing Claude Opus 4.8 on most of them.
One closer-to-neutral data point exists. An independent-style head-to-head from Kilo gave GLM 5.2 a planning edge — 9.0 vs Kimi’s 8.1 on a backend feature-flag service task, with GLM passing 15/15 verification checks to Kimi’s 14/15 and both producing near-identical working builds. That’s a useful signal, but it’s a single task by one evaluator, not a benchmark suite.
| Spec | GLM 5.2 (Zhipu / Z.ai) | Kimi K2.7 Code (Moonshot) |
|---|---|---|
| Released | June 13, 2026 | June 12, 2026 |
| Total / active params | ~744-753B MoE / ~40B | 1T MoE / 32B (384 experts) |
| Context window | 1,000,000 tokens | 256K (262,144) tokens |
| Max output | ~128-131K tokens | ~32K (32,768) tokens |
| Official coding benchmarks | SWE-bench Pro 62.1; Terminal-Bench 2.1 81.0; FrontierSWE 74.4 (vendor-published, some 3rd-party-run) | +21.8% on Kimi Code Bench v2 (62.0 vs 50.9, vendor-reported) |
| Independent SWE-bench | Not yet (public suites) | Not yet |
| API price (per 1M) | ~$1.40 in / ~$4.40 out; flat plan from $10/mo | $0.95 in / $4.00 out; $0.19 cached |
| License | MIT | Modified MIT (commercial OK; attribution if >100M MAU or >$20M/mo) |
| Endpoint compatibility | OpenAI- and Anthropic-compatible | OpenAI-compatible (Moonshot / OpenRouter) |
Pricing and value
The pricing models are structured differently, so the “cheaper” answer depends on usage.
Kimi K2.7 Code is straightforward metered API: $0.95 per million input tokens, $4.00 per million output, and a notable $0.19 per million for cached input. That cache rate matters for agentic coding, where you re-send a lot of stable context every step. At those rates Kimi is dramatically cheaper than Western frontier models — by output price alone, more than ten times cheaper than premium-tier options.
GLM 5.2 is metered around $1.40 input / $4.40 output per million (live across providers like FriendliAI, Novita, and Z.ai), but Zhipu also pushes the GLM Coding Plan, a flat subscription with Lite, Pro, Max, and Team tiers. Lite starts at $10/month (roughly 400 prompts/week), Pro at $30/month, and Max at $80/month — excellent value if you code in it daily and want predictable billing.
If you’re a solo developer living in an agent all day, GLM’s flat plan can be the cheaper real-world choice. If you’re running variable or bursty workloads, or building a product on top, Kimi’s metered rate plus cheap caching is easier to model. For a broader cost picture across self-hostable options, our roundup of the best local LLM for coding in 2026 puts both in context.
License and openness
Both are legitimately open-weight, which separates them from closed frontier labs — but the fine print differs.
GLM 5.2 uses a plain MIT license: use it, modify it, ship it commercially, no strings. Kimi K2.7 Code uses a Modified-MIT license that also permits commercial use, but adds one condition: if your product crosses 100 million monthly active users or $20 million in monthly revenue, you must prominently display “Kimi K2.7 Code” in the UI. For virtually every team that’s a non-issue; for a hyperscaler it’s a real clause. So on pure permissiveness, GLM 5.2’s MIT edges it.
GLM 5.2 strengths
- Full 1M-token context for repo-scale work
- Strongest published open-weight coding scores of the two
- Unrestricted MIT license
- Drop-in Anthropic + OpenAI endpoint compatibility
- Flat-rate coding plan from $10/mo
- High/Max thinking-effort control
GLM 5.2 caveats
- Benchmarks are vendor-published (some third-party-run); no broad independent SWE-bench suite yet
- Per-token API rate slightly higher than Kimi
- Smaller total parameter count
Agentic and tool use
Both models explicitly target long-horizon coding agents, not just snippet completion, and both expose strong tool-calling.
GLM 5.2’s standout for agent builders is compatibility: because Z.ai serves an Anthropic-compatible endpoint (alongside an OpenAI-compatible one), you can point Claude Code or an Anthropic-SDK agent at it by swapping the base URL and key — no rewrite. It also plugs natively into Cline, Cursor, and 20-plus dev tools, and its published long-horizon scores (FrontierSWE, PostTrainBench, SWE-Marathon) are aimed precisely at multi-hour agent workloads. Kimi K2.7 Code leans into measured agentic efficiency: Moonshot’s reported ~30% reduction in reasoning tokens is aimed directly at the cost and latency of multi-step agent loops, and the model posts gains on MCP-oriented suites. If you’re choosing an agent harness around either, our guide to the best AI agent frameworks in 2026 covers the orchestration layer.
How to actually run each
There are two paths, and for most people the answer is the cloud.
Cloud API is the easy route. Kimi K2.7 Code is available through Moonshot’s API and aggregators like OpenRouter; GLM 5.2 is live on the GLM Coding Plan and via OpenAI-/Anthropic-compatible endpoints (base URL api.z.ai). This is where nearly everyone should start.
Open weights are published — Kimi K2.7 Code is on Hugging Face with vLLM, SGLang, and KTransformers support, and GLM 5.2’s MIT weights are downloadable — but the hardware is serious. A 1T-parameter model (even at 32B active) or a ~750B model needs multi-GPU servers or aggressive GGUF quantization to run locally; these are not single-consumer-card models. If your goal is self-hosting smaller coders on commodity hardware, you’re better served by the best local LLMs to run on Ollama in 2026 than by either of these heavyweights.
How they fit next to DeepSeek V4 and Qwen 3.x
Neither model exists in a vacuum. DeepSeek V4-Pro (released April 2026) ships 1.6T params with a 1M context and an MIT license, and posts a verified 80.6% on SWE-bench Verified — currently the strongest open-weight number around. Qwen 3.6 Plus also offers a 1M context and a frontier-competitive 78.8% on SWE-bench Verified. In other words, GLM 5.2 and Kimi K2.7 Code are entering a crowded, fast-moving field where rivals already have published, partly independent benchmarks on the standard public suites. GLM 5.2’s vendor numbers are competitive, but the gold-standard SWE-bench Verified comparisons still belong to DeepSeek and Qwen for now. For a closer look at that pair, see our DeepSeek V4 vs Qwen3 comparison.
FAQ
Is GLM 5.2 or Kimi K2.7 Code better for coding?
There’s no fully independent answer yet, but on published numbers GLM 5.2 looks stronger for long-horizon coding: Zhipu’s benchmarks put it at SWE-bench Pro 62.1 and FrontierSWE 74.4, ahead of GPT-5.5 on several suites, with a 1M context and Claude Code compatibility. Kimi K2.7 Code is cheaper per token and reports +21.8% on its own coding benchmark. A single-task Kilo head-to-head gave GLM a slight planning edge (9.0 vs 8.1, 15/15 vs 14/15 checks). All headline scores are vendor-published, so wait for independent SWE-bench runs before treating any of it as final.
Does GLM 5.2 have published benchmarks?
Yes — but not at launch. Zhipu first deployed GLM 5.2 to its Coding Plan tiers on June 13, 2026 with no benchmark table, then published a full set alongside the API and MIT open weights days later: SWE-bench Pro 62.1, Terminal-Bench 2.1 81.0, FrontierSWE 74.4, PostTrainBench 34.3, and SWE-Marathon 13.0, beating GPT-5.5 on several long-horizon suites while trailing Claude Opus 4.8 on most. Several were run by third-party evaluators but curated by Z.ai, so they’re vendor-published, not fully independent.
Can I use GLM 5.2 with Claude Code?
Yes. Z.ai exposes an Anthropic-compatible endpoint (under api.z.ai, e.g. https://api.z.ai/api/anthropic or the coding endpoint), so you can point Claude Code or an Anthropic-SDK agent at GLM 5.2 by setting ANTHROPIC_BASE_URL and your Z.ai API key, then selecting the glm-5.2 (or glm-5.2[1m]) model — no code rewrite required. Expect to raise the request timeout, since first-token latency on the 1M context runs longer than Claude’s default.
How much does each model cost?
Kimi K2.7 Code is metered at $0.95 per million input tokens, $4.00 output, and $0.19 cached. GLM 5.2 is metered around $1.40 input / $4.40 output per million, or sold through the GLM Coding Plan from $10/month (Lite), with Pro at $30 and Max at $80.
Is Kimi K2.7 Code free for commercial use?
Effectively yes. It uses a Modified-MIT license that permits commercial use; the only added condition is that products exceeding 100 million monthly active users or $20 million in monthly revenue must display “Kimi K2.7 Code” in their UI. GLM 5.2’s plain MIT license has no such clause.
Can I run these models locally?
The weights are available — Kimi K2.7 Code on Hugging Face (vLLM/SGLang/KTransformers) and GLM 5.2 under MIT — but both are very large MoE models. Expect to need multi-GPU servers or heavy quantization; neither runs comfortably on a single consumer GPU.
Which has the larger context window?
GLM 5.2, by a wide margin: 1,000,000 tokens versus Kimi K2.7 Code’s 256K. That makes GLM the better fit for whole-repository context and very long agent traces, though strong agent tooling reduces how often you need the full window.
Bottom line
These are two excellent, genuinely open coding models that arrived a day apart, and the honest verdict is that it’s close — with GLM 5.2 now holding the on-paper edge. Both vendors have published coding benchmarks, and Zhipu’s are the stronger of the two (SWE-bench Pro 62.1, FrontierSWE 74.4, ahead of GPT-5.5 on several long-horizon suites), on top of a 1M context, an unrestricted MIT license, predictable flat-rate billing, and effortless Claude Code integration. Kimi K2.7 Code answers with the cheapest per-token price, a strong cache discount, token-efficient agent loops, and its own reported gains.
If you’re shipping a product or running heavy variable workloads, start with Kimi’s metered API and its cache discount. If you live inside a coding agent all day and value a 1M window, top published scores, and drop-in Anthropic compatibility, GLM 5.2’s coding plan is hard to beat. And whichever you pick, remember that every headline number here is vendor-published — wait for independent SWE-bench Verified results before treating any marketing claim as settled fact. In a field where DeepSeek V4-Pro already posts a verified 80.6% on SWE-bench Verified, the bar for “best open coder” is measured by neutral evaluators, not asserted by the labs that built the models.
