NVIDIA’s H100 defined the generative-AI boom. Its successor, the H200, looks almost identical on a compute spec sheet — because it is. The H200 uses the same Hopper GPU as the H100. What changed is the memory: more of it, and much faster.
For AI teams the question is precise: when does more memory bandwidth beat more raw FLOPS? With these two cards, it often does.
Principais conclusões
- The H100 and H200 share the same Hopper compute — identical FP16/FP8 TFLOPS.
- The H200 upgrades memory to 141 GB HBM3e at 4.8 TB/s, versus the H100’s 80 GB HBM3 at 3.35 TB/s.
- Para large-model inference, the H200 is up to ~1.6–1.9x faster — purely from memory.
- Para compute-bound training, the two are much closer; the H200’s edge shrinks to ~10–20%.
- If you serve large LLMs, the H200 is the clear pick. If you are training-bound on smaller models, the H100 is still excellent value.
At a glance
| Especificações | NVIDIA H200 | NVIDIA H100 |
|---|---|---|
| Arquitetura | Hopper GH100 | Hopper GH100 |
| VRAM | 141 GB HBM3e | 80 GB HBM3 |
| Largura de banda de memória | 4.8 TB/s | 3.35 TB/s |
| FP16 Tensor | ~990 TFLOPS | ~990 TFLOPS |
| FP8 Tensor | ~1,979 TFLOPS | ~1,979 TFLOPS |
| TDP (SXM) | 700 W | 700 W |
| Relative price | Higher | Lower |
Same engine, bigger fuel tank
The most important thing to understand: the H200 does not compute faster than the H100. Their tensor cores are identical, so peak FP16 and FP8 throughput match exactly. NVIDIA changed only the memory subsystem — swapping HBM3 for HBM3e, raising capacity from 80 GB to 141 GB and bandwidth from 3.35 to 4.8 TB/s.
That sounds narrow. It is not. Modern LLM serving is overwhelmingly memory-bound: the GPU spends its time moving weights and KV-cache, not saturating its math units. Give that workload 43% more bandwidth and you get most of that speedup directly.
Inference: where the H200 dominates
For serving large language models, the H200’s memory changes the economics:
- Capacity. A 70B model in FP16 needs ~140 GB. It does not fit on one 80 GB H100 — you need two, with the overhead of tensor parallelism. It fits on a single H200, eliminating cross-GPU communication entirely.
- Throughput. Even when a model fits on both, the H200’s bandwidth lifts token generation by roughly 1.6–1.9x for large models and long contexts.
- KV-cache headroom. The extra 61 GB lets you serve far more concurrent users or much longer context windows before running out of memory.
For inference-heavy deployments — chat APIs, RAG backends, agentic systems — the H200 is not a marginal upgrade. It changes how many GPUs you need.
Training: a narrower gap
Para pre-training and fine-tuning, compute matters more, and here the two cards converge. When a training job is FP8 or FP16 compute-bound, the H200’s identical tensor cores cap its advantage. The memory still helps — larger batch sizes, fewer gradient-accumulation steps, room for bigger optimizer states — but the end-to-end speedup typically lands in the 10–20% range rather than the 60–90% seen in inference.
If your bottleneck is training throughput on models that already fit comfortably in 80 GB, the H100 delivers nearly the same result for less money.
Choose the H200 if
- You serve large LLMs (70B+) and want them on a single GPU
- Your workload is inference-heavy and memory-bound
- You need long context windows or high concurrency
Choose the H100 if
- Your jobs are compute-bound training on models that fit in 80 GB
- You can buy or rent it at a meaningful discount
- You scale horizontally and already run multi-GPU clusters
The cloud-rental angle
Most teams never buy either card — they rent. On cloud GPU marketplaces the H200 commands a premium over the H100. The right question is therefore cost-per-token, not cost-per-hour. For large-model inference, the H200’s higher throughput often makes it cheaper per token despite the higher hourly rate. For smaller models or training, the H100’s lower rate usually wins. Benchmark your actual workload before committing.
By the numbers: the H200’s throughput lead
The H100 and H200 use the same GH100 die, so their raw compute (FLOPS) is identical. Every bit of the H200’s advantage comes from the memory subsystem: 141 GB of HBM3e at ~4.8 TB/s versus the H100’s 80 GB of HBM3 at 3.35 TB/s — about 76% more capacity and 43% more bandwidth.
That translates into a real but workload-dependent lead. In MLPerf v4.0, the H200 posted roughly 42% higher throughput on Llama 2 70B (offline) — about 31,700 tokens/sec versus the H100’s 22,300 — and at maximum single-GPU throughput it can reach up to 1.9× the H100 on Llama 70B. The catch: for any model and KV cache that already fits comfortably inside 80 GB, the gain shrinks to just 0–11%, because at that point compute (which is identical) becomes the bottleneck, not memory.
Should you wait for Blackwell?
Any H100-versus-H200 decision in 2026 has a third option lurking behind it: NVIDIA’s Blackwell B200. Unlike the H200, the B200 is a genuinely new architecture, not a memory refresh of Hopper. It moves to roughly 192 GB of HBM3e at around 8 TB/s and, critically, adds native FP4 support that Hopper lacks entirely. For low-precision inference, that combination pushes per-GPU throughput to roughly 2–2.5x an H200 on large models, and cost-per-token can fall further still once FP4 serving is dialed in.
So why would anyone still buy Hopper? Three reasons:
- Power and density. The B200 draws about 1,000 W versus 700 W for both Hopper cards. That changes rack power budgets, cooling, and often forces liquid cooling — a real obstacle for existing air-cooled data centers and most colocation setups.
- Price and availability. B200 cloud rates sit at a launch premium (commonly $4–6+/GPU-hour) against roughly $3/hour for an H200, and supply is tighter. Hopper inventory is mature and easy to rent today.
- Software maturity. Hopper’s FP8 and CUDA tooling are battle-tested across every major inference and training framework. FP4 is newer, and squeezing the B200’s headline numbers out of it takes engineering effort.
A useful rule of thumb: if your workload is FP4-friendly, runs at high volume, and you can power it, Blackwell wins on cost-per-token. If you need capacity now, run a mature FP8/FP16 stack, or can’t accommodate 1,000 W per accelerator, the H200 remains the pragmatic choice — and the H100 the budget one. The H200 also slots neatly into existing HGX H100 systems, making it the lowest-friction upgrade for teams already on Hopper. Blackwell is the bigger leap, but the H200 is the one you can deploy this afternoon without re-architecting your facility.
Perguntas frequentes
Is the H200 faster than the H100?
For memory-bound work like large-LLM inference, yes — up to ~1.9x faster. For compute-bound training, barely — the two share identical tensor cores, so the H200’s lead shrinks to 10–20%.
Why is the H200 faster if it has the same compute?
Because most LLM serving is limited by memory bandwidth, not math. The H200’s HBM3e delivers 4.8 TB/s versus the H100’s 3.35 TB/s, and that 43% bandwidth gain translates almost directly into faster token generation.
Can the H200 run a 70B model on a single GPU?
Yes. With 141 GB of HBM3e, a 70B model in FP16 (~140 GB) fits on one H200. The 80 GB H100 cannot hold it alone and needs a two-GPU setup.
Is the H100 still worth using in 2026?
Absolutely. The H100 remains a top-tier training GPU. It is the better value for compute-bound jobs and for workloads that fit within 80 GB. It is only outclassed when memory capacity or bandwidth is the bottleneck.
How much faster is the H200 than the H100 for Llama 70B?
About 42% more throughput in MLPerf v4.0 offline mode (~31,700 vs ~22,300 tokens/sec), and up to 1.9× at maximum single-GPU throughput. The advantage is largest for big-batch and long-context inference that pushes past the H100’s memory limits.
Does the H200 have more compute than the H100?
No. Both are built on the same GH100 die with identical FLOPS. The entire upgrade is memory — more capacity (141 GB vs 80 GB) and more bandwidth (4.8 vs 3.35 TB/s). If your workload isn’t memory-bound, the two perform almost the same.
When is the H100 still the better buy?
When your model plus KV cache fits inside 80 GB. There the H200’s lead drops to 0–11%, so the cheaper and more widely available H100 usually wins on price-per-performance.
Is the H200 more power-efficient than the H100?
Yes. Both cards share the same 700 W TDP, but the H200 does more work inside that envelope. For large-LLM inference NVIDIA cites up to roughly 50% lower energy per inference, and at a matched power budget the H200 generates more tokens per second than the H100. Same watts, more output — which is why it lowers total cost of ownership for inference-heavy fleets.
How does the B200 compare to the H200 for inference?
The B200 is a generational step up: about 192 GB of HBM3e, roughly 8 TB/s of bandwidth, and native FP4 that Hopper lacks. On large models that pushes per-GPU throughput to around 2–2.5x an H200, with materially lower cost-per-token in FP4 serving. The trade-offs are a higher ~1,000 W draw, a launch price premium, and a less mature low-precision software stack.
Can I drop an H200 into an existing H100 server?
Generally yes. The H200 SXM uses the same Hopper architecture and the same 700 W envelope, so it is designed to slot into existing HGX H100 baseboards and systems with minimal disruption. That backward compatibility is a major reason teams already standardized on Hopper choose the H200 over jumping straight to Blackwell, which typically requires new chassis and often liquid cooling.
Verdict
O H200 is the same Hopper chip with a transformative memory upgrade — and for the inference workloads that dominate AI spending in 2026, that upgrade is decisive. Single-GPU 70B serving, longer contexts, higher concurrency: the H200 enables all of it. The H100 is far from obsolete; for compute-bound training and any job that fits in 80 GB, it remains an excellent and more affordable choice. Match the card to your bottleneck — bandwidth, or FLOPS.
