Moonshot AI released Kimi K2.7 Code on June 12, 2026, and the name matters more than usual. This is not a new general chatbot called “Kimi 2.7.” It is a coding-only model: a 1-trillion-parameter Mixture-of-Experts system tuned specifically to plan, edit files, run tools, and grind through multi-step software tasks. For ordinary chat, Moonshot still points you at the older K2.6.
The pitch is efficiency. K2.7 Code claims higher coding scores than K2.6 while burning roughly 30% fewer reasoning tokens, and it lists at $0.95 per million input tokens and $4.00 per million output. That is a fraction of what frontier closed models charge. The weights are public under a Modified MIT license, so you can also run it yourself — if you have the hardware for a model that lands at about 595 GB on disk even in its native 4-bit form. Here is what is real, what is vendor-reported, and where it fits.
Key takeaways
- Coding-only, not a chatbot. “K2.7 Code” is a dedicated agentic coding model; Moonshot recommends K2.6 for general use.
- 1T MoE, 32B active. 384 experts (8 routed + 1 shared), 61 layers, 256K context, 160K vocab, MLA attention, plus a 400M-param MoonViT vision encoder for image and video input.
- Thinking is mandatory. There is no non-thinking mode; disabling it returns an API error.
- Vendor-reported gains over K2.6: +21.8% Kimi Code Bench v2, +11.0% Program Bench, +31.5% MLS Bench Lite, with ~30% fewer reasoning tokens.
- Aggressive pricing: $0.95 in / $4.00 out per million tokens, with cache hits near $0.19 — roughly 6x under Claude Opus 4.8 and up to ~12x under Claude Fable 5 on output.
- Open weights, heavy hardware. Modified MIT license on Hugging Face; the weights ship natively in int4 (~595 GB), and realistic local inference still needs roughly 8 80GB-class GPUs (~640 GB VRAM).
What Kimi K2.7 Code actually is
K2.7 Code is the latest in Moonshot’s fast-moving Kimi line, and it is the first the company has split off as a coding-specialized release rather than a general model with a coding mode. The design goal is long-horizon software engineering: the kind of work where an agent reads a repo, plans a change, edits several files, runs a build, reads the error, and iterates. It is built to act, not to converse.
That focus shows up in the defaults. The model always runs with “thinking” enabled — there is no way to turn it off, and the API rejects requests that try. The bet is that for agentic coding, the reasoning traces are worth their cost, and that K2.7’s efficiency gains keep that cost in check. If you want a model that just answers a quick question cheaply, Moonshot itself says to use K2.6 instead. We cover the broader family in our Moonshot Kimi explainer.
Specs and architecture
The architecture is a sparse MoE. Of the 1 trillion total parameters, only about 32 billion activate per token, which is what keeps inference cost and latency far below what a 1T dense model would imply.
| Spec | Kimi K2.7 Code |
|---|---|
| Total parameters | 1 trillion (MoE) |
| Active per token | ~32 billion |
| Experts | 384 (8 routed + 1 shared) |
| Layers | 61 (1 dense) |
| Context window | 256K tokens (262,144) |
| Vocabulary | 160K |
| Attention | MLA (Multi-head Latent Attention) |
| Modality | Text, image, video (via 400M MoonViT encoder) |
| Native precision | INT4 (MoE weights), BF16 attention |
| Thinking mode | Mandatory (cannot disable) |
| License | Modified MIT (open weights) |
The native multimodal input is a genuine differentiator for a coding model. You can hand it a screenshot of a broken UI, a diagram, or a short screen recording alongside the code. Most coding-focused open models are text-only, so this widens the practical use cases — debugging from a screenshot, implementing from a mockup — without a separate vision pipeline.
The benchmark gains, read honestly
Moonshot’s headline numbers compare K2.7 Code to K2.6 on its own internal suites. These are vendor-reported and use Moonshot’s benchmarks, so treat them as directional rather than neutral ground truth.
| Benchmark (vendor-reported) | K2.6 | K2.7 Code | Change |
|---|---|---|---|
| Kimi Code Bench v2 | 50.9 | 62.0 | +21.8% |
| Program Bench | 48.3 | 53.6 | +11.0% |
| MLS Bench Lite | 26.7 | 35.1 | +31.5% |
| MCPMark Verified | 72.8 | 81.1 | +11.4% |
| Reasoning tokens used | baseline | ~30% fewer | more efficient |
On agent-tool benchmarks (MCP Atlas, MCPMark Verified, Kimi’s own Claw 24/7), Moonshot reports gains of roughly 10% over K2.6 — smaller, but in the right direction.
Independent data is starting to land. Artificial Analysis, which runs its own measurements rather than republishing vendor claims, places K2.7 Code at 42 on its composite Intelligence Index, ranking it around #6 among the open-weight models it tracks. It clocks output at about 55.8 tokens per second with a ~2.25-second time to first token on Moonshot’s standard API — respectable, not record-setting, and the mandatory thinking mode means real-world latency on a full agent task is higher than that first-token number suggests. (Moonshot also offers a separate high-speed endpoint that runs far faster, but the headline model is the one benchmarked here.)
The most useful third-party comparison comes from head-to-head coding tests. On MCPMark Verified, an agent-tool benchmark, K2.7 Code scores 81.1, edging out Claude Opus 4.8 at 76.4 — but GPT-5.5 sits well ahead at 92.9. On Moonshot’s own Program Bench, GPT-5.5 leads 69.1 to 53.6. The honest summary: K2.7 Code is competitive with frontier models on some agentic-tool tasks and clearly behind on others. It is not the new state of the art. Its case rests on price.
Pricing and value
This is where K2.7 Code makes noise. Here is the published API pricing against the current closed frontier, per million tokens.
| Model | Input | Output |
|---|---|---|
| Kimi K2.7 Code | $0.95 | $4.00 |
| Claude Opus 4.8 | $5.00 | $25.00 |
| GPT-5.5 | $5.00 | $30.00 |
| Claude Fable 5 | $10.00 | $50.00 |
On output, K2.7 Code is roughly 6x cheaper than Opus 4.8 and more than 12x cheaper than Fable 5. Cache hits cost around $0.19 per million input tokens, which matters a lot for agents that re-read the same files repeatedly. Combine that with ~30% fewer reasoning tokens per task, and the effective cost gap widens further.
The trade is straightforward: lower raw capability per call, but the same budget buys many more calls. For high-volume agentic workloads — CI bots, bulk refactors, test generation, automated triage — running K2.7 Code several times and keeping the best result can beat one expensive frontier call. For a single, subtle architectural decision, the frontier model’s higher hit rate may still be worth the premium. If you are weighing options across the field, our roundup of the best AI coding assistants puts this in context.
Strengths
- Open weights under a permissive Modified MIT license
- Very low per-token cost with cheap cache hits
- Native image and video input, rare for a coding model
- 256K context suits whole-repo agentic work
- ~30% reasoning-token reduction trims agent bills
Limitations
- Trails GPT-5.5 on multiple coding benchmarks
- Mandatory thinking mode adds latency and rules out fast non-reasoning calls
- Local hosting needs data-center-class GPUs
- Headline gains are vendor-reported on internal suites
- Not recommended for general chat — narrow by design
How to use it: API vs running the weights
The easy path is the API. K2.7 Code is available through Moonshot’s Kimi API and its Kimi Code CLI, and it speaks the standard tool-calling conventions, so it drops into most existing agent setups. If you build on agent scaffolding, see our guide to the best AI agent frameworks for where a model like this slots in.
Running the open weights is a different story, and this is the part to be clear-eyed about. Like Kimi K2 Thinking before it, K2.7 Code ships pre-quantized in native int4 — the MoE weights are stored at 4-bit via quantization-aware training while attention stays in BF16 — which is why the Hugging Face release lands at roughly 595 GB on disk rather than the ~2 TB a full BF16 copy of a 1T-parameter model would need. (A full-precision BF16 build is not what Moonshot distributes.) Serving is supported through vLLM, SGLang, and KTransformers.
| Setup | Reality |
|---|---|
| ~8x 80GB-class GPUs (≈640 GB VRAM), native int4 | Recommended full-context production setup (≈5x H200 is a rough equivalent) |
| 4x RTX 4090 (96 GB), with CPU/RAM offload | Possible, but context capped ~64K–128K and much lower throughput |
| Single consumer GPU | Not viable for the full model |
In short, “open weights” does not mean “runs on your laptop.” Even at native 4-bit the weights alone exceed half a terabyte, so for most teams the API is the sensible route, and self-hosting is for organizations with serious GPU budgets or strict data-residency needs. If local is a hard requirement, weigh smaller options in our best local LLM for coding guide, which covers models that fit real hardware.
How it compares to K2.6 and rivals
Against K2.6, K2.7 Code is the better tool for sustained, multi-step coding agents and the worse tool for everything else — Moonshot’s own guidance is to keep K2.6 for general tasks. The split is deliberate: one model optimized for agentic coding, the other for breadth.
Against the broader open field, the obvious 2026 rival is Zhipu’s GLM-5.2, another large open model chasing the same coding-agent niche; we break that one down in our GLM-5.2 explainer, and pit the two against each other in GLM-5.2 vs Kimi K2.7 for coding. A fair head-to-head is still hard to call: Zhipu shipped GLM-5.2 without published benchmark numbers, and neutral third parties have not yet posted directly comparable agentic-coding scores for the two, so any “winner” claim today is premature. Against the closed frontier, K2.7 Code is a value play, not a capability leader: you accept a measurable gap to GPT-5.5 in exchange for open weights and a price that can be an order of magnitude lower.
FAQ
Is Kimi K2.7 Code a chatbot or a coding model?
It is a coding-specialized model built for agentic software tasks — planning, editing files, running tools, and debugging across many steps. It is not positioned as a general chatbot. Moonshot recommends the older K2.6 for general conversation and reserves K2.7 Code for coding work.
How much does Kimi K2.7 Code cost?
The API lists $0.95 per million input tokens and $4.00 per million output tokens, with cache hits around $0.19 per million input. That is roughly 6x cheaper than Claude Opus 4.8 on output and over 12x cheaper than Claude Fable 5.
Can I run Kimi K2.7 Code locally?
Yes, the weights are public under a Modified MIT license, but it is a 1T-parameter model that takes about 595 GB on disk even in its native int4 format. A realistic production setup needs roughly 8 80GB-class GPUs (~640 GB VRAM) — about five H200s is a rough equivalent. A 4x RTX 4090 rig can run it only with CPU/RAM offload, reduced context, and lower throughput, and no single consumer GPU will hold the full model.
How much better is K2.7 Code than K2.6?
Moonshot reports +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, +31.5% on MLS Bench Lite, and +11.4% on MCPMark Verified, plus about 30% fewer reasoning tokens per task. These are vendor-reported figures on Moonshot’s own benchmarks, so treat them as directional.
Does Kimi K2.7 Code support images?
Yes. It includes a 400M-parameter MoonViT vision encoder and accepts text, image, and video input. That lets it work from screenshots, diagrams, or short recordings — unusual for a coding-focused open model.
Is Kimi K2.7 Code better than GPT-5.5 for coding?
Not on most benchmarks. GPT-5.5 leads on Program Bench (69.1 vs 53.6) and MCPMark Verified (92.9 vs 81.1). K2.7 Code’s advantage is cost: the price gap means you can run it far more often for the same budget, which can win on high-volume agentic workloads.
What is “thinking mode” and can I turn it off?
Thinking mode is the model’s internal reasoning step before it answers. In K2.7 Code it is mandatory — there is no non-thinking mode, and the API returns an error if you try to disable it. The efficiency claim is that it now reaches answers using ~30% fewer reasoning tokens than K2.6.
Bottom line
Kimi K2.7 Code is a sharp, deliberately narrow release: an open-weight 1T coding agent that trades a real capability gap to GPT-5.5 for pricing that is hard to argue with and a license that lets you own the model outright. It will not top the leaderboards, and the mandatory thinking mode plus data-center hardware requirement — over half a terabyte of weights even at native 4-bit — mean it is not for everyone. But for teams running high-volume agentic coding, where cost per task compounds fast, it is one of the most credible value plays of 2026. Use the API unless you have the GPUs and a reason to self-host, benchmark it on your own repos before committing, and keep K2.6 around for the chat it was never meant to do.
