At Computex 2026, NVIDIA confirmed that Vera Rubin — the successor to the Blackwell architecture that powers today’s AI boom — is now in full production. It’s the most consequential AI-hardware announcement of the year, and the headline number is staggering: NVIDIA says Rubin cuts the cost of AI inference by up to 10×. That doesn’t just matter to hyperscalers building data centers — it shapes the price of every AI tool you use. Here’s a clear, professional breakdown of what Vera Rubin actually is.
Key takeaways
- Vera Rubin is NVIDIA’s next-generation AI platform, the successor to Blackwell — now in full production (announced at Computex 2026).
- The headline: NVIDIA’s figures claim up to 10× lower inference token cost and 4× fewer GPUs to train Mixture-of-Experts models vs Blackwell.
- It’s a six-chip platform, not just a GPU — the flagship Vera Rubin NVL72 packs 72 Rubin GPUs and 36 Vera CPUs.
- Rubin CPX is a separate new GPU built for million-token context inference (coding, video), with 128GB of GDDR7 each.
- Availability: cloud instances in H2 2026 (AWS, Google Cloud, Azure, OCI and more); Rubin CPX at the end of 2026.
What is NVIDIA Vera Rubin?
Vera Rubin is NVIDIA’s next-generation AI compute platform — the architecture that follows Blackwell (the GB200/GB300 generation currently powering most frontier AI training and inference). Named after the astronomer who provided early evidence for dark matter, Rubin isn’t a single chip but a tightly co-designed platform of six chips engineered to work as one “AI factory.”
The strategic goal is efficiency. Training and serving today’s largest models is brutally expensive, and the single biggest cost in production AI is inference — actually running the model for users. Rubin is NVIDIA’s answer to that cost curve.
The headline numbers — and what they mean
Two figures from NVIDIA define why Rubin matters:
- Up to a 10× reduction in inference token cost versus Blackwell. Inference cost is what determines the price of an AI API call. A 10× efficiency gain is the kind of step-change that lets providers slash prices, raise rate limits, or ship far more capable models at the same cost.
- A 4× reduction in the number of GPUs needed to train Mixture-of-Experts (MoE) models. Nearly every frontier model in 2026 — from GPT to Claude to the open Chinese models — is an MoE. Cutting the GPU count 4× directly lowers the barrier to training frontier-scale models.
As always with vendor benchmarks, treat these as NVIDIA’s best-case figures until independent labs verify them. But even a fraction of the claimed gains reshapes the economics of AI. The reason your AI tools keep getting cheaper and faster is hardware like this.
The six chips that make up the platform
Rubin’s efficiency comes from co-designing the whole rack, not just the GPU. The platform spans six chips:
- Vera CPU — 88 custom “Olympus” cores (Armv9.2), tuned for agentic reasoning and tightly coupled to the GPUs via NVLink-C2C.
- Rubin GPU — the compute engine, with a third-generation Transformer Engine, hardware-accelerated adaptive compression, and 50 petaflops of NVFP4 inference performance.
- NVLink 6 Switch — the interconnect, at 3.6 TB/s per GPU and 260 TB/s aggregate across a single NVL72 rack.
- ConnectX-9 SuperNIC — high-speed networking integrated into the NVL72 design.
- BlueField-4 DPU — powers AI-native storage and efficient key-value (KV) cache reuse, which directly speeds up long-context inference.
- Spectrum-6 Ethernet Switch — built on 200G SerDes with co-packaged optics for scale-out AI factories.
The flagship system, the Vera Rubin NVL72, combines 72 Rubin GPUs and 36 Vera CPUs into one rack — and NVIDIA says it’s up to 18× faster to assemble and service than Blackwell, which matters enormously at data-center scale.
Rubin CPX: a GPU built for million-token context
Alongside the standard platform, NVIDIA unveiled a genuinely new category: the Rubin CPX, a GPU “purpose-built for massive-context processing.” This is the chip aimed squarely at the long-context era — the million-token software-coding and generative-video workloads that today’s models increasingly demand.
Each Rubin CPX carries 128GB of GDDR7 and up to 30 petaflops of NVFP4 compute, and uniquely integrates video encode/decode hardware alongside long-context inference on one chip. At rack scale, the Vera Rubin NVL144 CPX delivers a claimed 8 exaflops of AI compute and 100TB of fast memory, which NVIDIA says is 7.5× more AI performance than a GB300 NVL72 system, with 3× faster attention. It’s expected at the end of 2026.
For anyone tracking why context windows keep ballooning — the 1M-token windows in models like DeepSeek and the latest frontier models — Rubin CPX is the hardware making million-token inference economically viable.
When can you actually use it?
Rubin is a data-center platform, so you won’t buy one — but you’ll feel it through the services you use:
- Cloud instances arrive in the second half of 2026. Among the first providers: AWS, Google Cloud, Microsoft Azure, and OCI, plus NVIDIA Cloud Partners CoreWeave, Lambda, Nebius, and Nscale. If you rent GPUs, watch our roundup of the best cloud GPU providers for AI for when Rubin instances list.
- Rubin CPX lands at the end of 2026 for long-context and video workloads.
- The local angle: at Computex, NVIDIA also laid out a roadmap bringing the architecture toward local AI desktops and laptops — its RTX/DGX Spark line, with a Rubin-based generation (using LPDDR6 memory) followed by future “Rosa” and “Feynman” designs. So the technology that starts in the data center is on a path to the desk, much like today’s personal AI computers.
Rubin vs Blackwell
| Dimension | Vera Rubin (next-gen) | Blackwell (current) |
|---|---|---|
| Flagship system | Vera Rubin NVL72 | GB300 NVL72 |
| Inference token cost | Up to 10× lower | Baseline |
| GPUs to train an MoE | 4× fewer | Baseline |
| Assembly / servicing | Up to 18× faster | Baseline |
| Long-context chip | Rubin CPX (128GB, 1M-token) | — |
| Status | Full production; cloud H2 2026 | Shipping now |
Why it matters — even if you never touch one
It’s tempting to file data-center GPUs under “not my problem.” But Rubin affects everyone who uses AI:
- Cheaper, more capable AI tools. A 10× inference efficiency gain is what lets providers keep cutting API prices and raising limits. The relentless drop in the cost of using models like Claude and GPT is downstream of exactly this kind of hardware leap.
- Longer context, for real. Rubin CPX makes million-token inference economical, which is why frontier models keep extending their context windows.
- The squeeze on consumer GPUs. The flip side: insatiable demand for AI accelerators (and the memory they consume) is part of why consumer graphics cards are scarce and pricey in 2026. If you’re building a local AI rig, see our best GPUs for local LLMs guide.
- The local trickle-down. What ships in an NVL72 rack today defines what lands in a desktop AI box in a couple of years.
FAQ
What is NVIDIA Vera Rubin?
Vera Rubin is NVIDIA’s next-generation AI platform and the successor to Blackwell, announced in full production at Computex 2026. It’s a co-designed six-chip platform (Vera CPU, Rubin GPU, NVLink 6, ConnectX-9, BlueField-4, Spectrum-6) built to dramatically lower the cost of training and running AI models.
How much faster is Rubin than Blackwell?
According to NVIDIA’s own figures, Rubin delivers up to a 10× reduction in inference token cost and needs 4× fewer GPUs to train Mixture-of-Experts models compared with Blackwell. Its flagship NVL72 system is also up to 18× faster to assemble and service. These are vendor benchmarks, so independent verification is still pending.
What is the Rubin CPX?
Rubin CPX is a new class of NVIDIA GPU purpose-built for massive-context inference — think million-token coding and generative video. Each has 128GB of GDDR7 and up to 30 petaflops of NVFP4 compute, with integrated video encode/decode. It’s expected at the end of 2026.
When will NVIDIA Rubin be available?
Rubin is in full production now, with cloud instances expected in the second half of 2026 from providers including AWS, Google Cloud, Microsoft Azure, OCI, CoreWeave, Lambda, Nebius, and Nscale. Rubin CPX arrives at the end of 2026.
Can I buy a Rubin GPU for my PC?
No — Rubin is a data-center platform you’ll access through cloud providers, not a consumer card. However, NVIDIA outlined a roadmap bringing the architecture to local AI desktops and laptops (its RTX/DGX Spark line) over the next few generations.
What does Rubin mean for AI prices?
Lower inference cost is the main lever behind falling AI API prices and rising usage limits. If NVIDIA’s efficiency claims hold up, Rubin should help make the AI tools you use cheaper, faster, and capable of handling much longer inputs.
Bottom line
Vera Rubin is the clearest signal yet of where AI is heading: not just smarter models, but radically cheaper ones to run. By co-designing an entire six-chip platform around inference efficiency — and adding a dedicated million-token chip in the Rubin CPX — NVIDIA is attacking the single biggest cost in production AI. The claimed 10× inference saving won’t all reach your bill, and the vendor numbers deserve independent scrutiny. But the direction is unmistakable: the hardware that makes AI expensive today is being replaced by hardware that makes it cheap tomorrow — and that’s why your AI tools will keep getting better and more affordable through 2026 and beyond.
