NVIDIA Hopper GH100 GPU Unveiled: The World’s First & Fastest 4nm Data Center Chip, Up To 4000 TFLOPs Compute, HBM3 3 TB/s Memory

Hassan Mujtaba • Mar 22, 2022 11:33 AM EDT

• Copy Shortlink

NVIDIA has officially unveiled its next-generation data center powerhouse, the Hopper GH100 GPU, featuring a brand new 4nm process node. The GPU is an absolute monster with 80 Billion transistors and offering the fastest AI & Compute horsepower of any GPU on the market.

NVIDIA Hopper GH100 GPU Official: First 4nm & HBM3 Equipped Data Center Chip, 80 Billion Transistors, Fastest AI/Compute Product On The Planet With Up To 4000 TFLOPs of Horsepower

Based on the Hopper architecture, the Hopper GPU is an engineering marvel that's produced on the bleeding-edge TSMC 4nm process node. Just like the data center GPUs that came before it, the Hopper GH100 will be targetted at various workloads including Artificial Intelligence (AI), Machine Learning (ML), Deep Neural Networking (DNN) and various HPC focused compute workloads.

The GPU is the one-go solution for all HPC requirements and it's one monster of a chip if we look at its size and performance figures.

New Streaming Multiprocessor (SM) has many performances and efficiency improvements. Key new features include:

New fourth-generation Tensor Cores are up to 6x faster chip-to-chip compared to A100, including per-SM speedup, additional SM count, and higher clocks of H100. On a per SM basis, the Tensor Cores deliver 2x the MMA (Matrix MultiplyAccumulate) computational rates of the A100 SM on equivalent data types, and 4x the rate of A100 using the new FP8 data type, compared to the previous generation 16-bit floating-point options. The Sparsity feature exploits fine-grained structured sparsity in deep learning networks, doubling the performance of standard Tensor Core operations.
New DPX Instructions accelerate Dynamic Programming algorithms by up to 7x over the A100 GPU. Two examples include the Smith-Waterman algorithm for genomics processing, and the Floyd-Warshall algorithm used to find optimal routes for a fleet of robots through a dynamic warehouse environment.
○ 3x faster IEEE FP64 and FP32 processing rates chip-to-chip compared to A100, due to 2x faster clock-for-clock performance per SM, plus additional SM counts and higher clocks of H100.
New Thread Block Cluster feature allows programmatic control of locality at a granularity larger than a single Thread Block on a single SM. This extends the CUDA programming model by adding another level to the programming hierarchy to now include Threads, Thread Blocks, Thread Block Clusters, and Grids. Clusters enable multiple Thread Blocks running concurrently across multiple SMs to synchronize and collaboratively fetch and exchange data.
○ New Asynchronous Execution features include a new Tensor Memory Accelerator (TMA) unit that can transfer large blocks of data very efficiently between global memory and shared memory. TMA also supports asynchronous copies between Thread Blocks in a Cluster. There is also a new Asynchronous Transaction Barrier for doing atomic data movement and synchronization.
New Transformer Engine uses a combination of software and custom Hopper Tensor Core technology designed specifically to accelerate Transformer model training and inference. The Transformer Engine intelligently manages and dynamically chooses between FP8 and 16-bit calculations, automatically handling re-casting and scaling between FP8 and 16-bit in each layer to deliver up to 9x faster AI training and up to 30x
faster AI inference speedups on large language models compared to the prior generation A100.
HBM3 memory subsystem provides nearly a 2x bandwidth increase over the previous generation. The H100 SXM5 GPU is the world’s first GPU with HBM3 memory delivering a class-leading 3 TB/sec of memory bandwidth.
50 MB L2 cache architecture caches large portions of models and datasets for repeated access, reducing trips to HBM3.
NVIDIA H100 Tensor Core GPU Architecture compared to A100. Confidential Computing capability with MIG-level Trusted Execution Environments (TEE) is now provided for the first time. Up to seven individual GPU Instances are supported, each with dedicated NVDEC and NVJPG units. Each Instance now includes its own set of performance monitors that work with NVIDIA developer tools.
New Confidential Computing support protects user data, defends against hardware and software attacks, and better isolates and protects VMs from each other in virtualized and MIG environments. H100 implements the world's first native Confidential Computing GPU and extends the Trusted Execution Environment with CPUs at a full PCIe line rate.
Fourth-generation NVIDIA NVLink® provides a 3x bandwidth increase on all-reduce operations and a 50% general bandwidth increase over the prior generation NVLink with 900 GB/sec total bandwidth for multi-GPU IO operating at 7x the bandwidth of PCIe Gen 5.
Third-generation NVSwitch technology includes switches residing both inside and outside of nodes to connect multiple GPUs in servers, clusters, and data center environments. Each NVSwitch inside a node provides 64 ports of fourth-generation NVLink links to accelerate multi-GPU connectivity. Total switch throughput increases to 13.6 Tbits/sec from 7.2 Tbits/sec in the prior generation. New third-generation NVSwitch technology also provides hardware acceleration for collective operations with multicast and NVIDIA SHARP in-network reductions.
New NVLink Switch System interconnect technology and new second-level NVLink Switches based on third-gen NVSwitch technology introduce address space isolation and protection, enabling up to 32 nodes or 256 GPUs to be connected over NVLink in a 2:1 tapered, fat tree topology. These connected nodes are capable of delivering 57.6
TB/sec of all-to-all bandwidth and can supply an incredible one exaFLOP of FP8 sparse AI compute.
PCIe Gen 5 provides 128 GB/sec total bandwidth (64 GB/sec in each direction) compared to 64 GB/sec total bandwidth (32GB/sec in each direction) in Gen 4 PCIe. PCIe Gen 5 enables H100 to interface with the highest performing x86 CPUs and SmartNICs / DPUs (Data Processing Units).

So coming to the specifications, the NVIDIA Hopper GH100 GPU is composed of a massive 144 SM (Streaming Multiprocessor) chip layout which is featured in a total of 8 GPCs. These GPCs rock total of 9 TPCs which are further composed of 2 SM units each. This gives us 18 SMs per GPC and 144 on the complete 8 GPC configuration. Each SM is composed of up to 128 FP32 units which should give us a total of 18,432 CUDA cores. Following are some of the configurations you can expect from the H100 chip:

The full implementation of the GH100 GPU includes the following units:

8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU
128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU
4 Fourth-Generation Tensor Cores per SM, 576 per full GPU
6 HBM3 or HBM2e stacks, 12 512-bit Memory Controllers
60 MB L2 Cache
Fourth-Generation NVLink and PCIe Gen 5

The NVIDIA H100 GPU with SXM5 board form-factor includes the following units:

8 GPCs, 66 TPCs, 2 SMs/TPC, 132 SMs per GPU
128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU
4 Fourth-generation Tensor Cores per SM, 528 per GPU
80 GB HBM3, 5 HBM3 stacks, 10 512-bit Memory Controllers
50 MB L2 Cache
Fourth-Generation NVLink and PCIe Gen 5

The NVIDIA H100 GPU with a PCIe Gen 5 board form-factor includes the following units:

7 or 8 GPCs, 57 TPCs, 2 SMs/TPC, 114 SMs per GPU
128 FP32 CUDA Cores/SM, 14592 FP32 CUDA Cores per GPU
4 Fourth-generation Tensor Cores per SM, 456 per GPU
80 GB HBM2e, 5 HBM2e stacks, 10 512-bit Memory Controllers
50 MB L2 Cache
Fourth-Generation NVLink and PCIe Gen 5

This is a 2.25x increase over the full GA100 GPU configuration. NVIDIA is also leveraging from more FP64, FP16 & Tensor cores within its Hopper GPU which would drive up performance immensely. And that's going to be a necessity to rival Intel's Ponte Vecchio which is also expected to feature 1:1 FP64.

The cache is another space where NVIDIA has given much attention, upping it to 48 MB in the Hopper GH100 GPU. This is a 20% increase over the 50 MB cache featured on the Ampere GA100 GPU and 3x the size of AMD's flagship Aldebaran MCM GPU, the MI250X.

Rounding up the performance figures, NVIDIA's GH100 Hopper GPU will offer 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32 and 60 TFLOPs of FP64 Compute performance. These record-shattering figures decimate all other HPC accelerators that came before it. For comparison, this is 3.3x faster than NVIDIA's own A100 GPU and 28% faster than AMD's Instinct MI250X in the FP64 compute. In FP16 compute, the H100 GPU is 3x faster than A100 and 5.2x faster than MI250X which is literally bonkers.

NVIDIA GH100 GPU Block Diagram:

Some key features of the 4th Generation NVIDIA Hopper GH100 GPU SM (Streaming Multiprocessor) include:

Up to 6x faster chip-to-chip compared to A100, including per-SM speedup, additional SM count, and higher clocks of H100.
On a per SM basis, the Tensor Cores deliver 2x the MMA (Matrix Multiply-Accumulate) computational rates of the A100 SM on equivalent data types, and 4x the rate of A100 using the new FP8 data type, compared to the previous generation 16-bit floating-point options.
Sparsity feature exploits fine-grained structured sparsity in deep learning networks, doubling the performance of standard Tensor Core operations.
New DPX Instructions accelerate Dynamic Programming algorithms by up to 7x over the A100 GPU. Two examples include the Smith-Waterman algorithm for genomics processing, and the Floyd-Warshall algorithm used to find optimal routes for a fleet of robots through a dynamic warehouse environment.
3x faster IEEE FP64 and FP32 processing rates chip-to-chip compared to A100, due to 2x faster clock-for-clock performance per SM, plus additional SM counts and higher clocks of H100.
256 KB of combined shared memory and L1 data cache, 1.33x larger than A100.
New Asynchronous Execution features include a new Tensor Memory Accelerator (TMA) unit that can efficiently transfer large blocks of data between global memory and shared memory. TMA also supports asynchronous copies between Thread Blocks in a Cluster. There is also a new Asynchronous Transaction Barrier for doing atomic data movement and synchronization.
New Thread Block Cluster feature exposes control of locality across multiple SMs.
Distributed Shared Memory allows direct SM-to-SM communications for loads, stores, and atomics across multiple SM shared memory blocks.

NVIDIA GH100 SM Block Diagram:

For memory, the NVIDIA Hopper GH100 GPU is equipped with the brand new HBM3 memory that operates across a 6144-bit bus interface and delivers up to 3 TB/s of bandwidth, a 50% increase over the A100's HBM2e memory subsystem. Each H100 accelerator will be equipped with 80 GB of memory though we can expect a double memory capacity configuration in the future like the A100 80 GB.

The GPU also features PCIe Gen 5 compliancy with up to 128 GB/s transfer rates and an NVLINK interface that provides 900 GB/s of GPU-to-GPU inter-connected bandwidth. The whole Hopper H100 chip offers an insane 4.9 TB/s of external bandwidth. All of this monster performance comes in a 700W (SXM) package. The PCIe variants will be equipped with the latest PCIe Gen 5 connectors, allowing for up to 600W of power but the actual PCIe variant operates at a TDP of 350W.

NVIDIA Ampere GH100 Compute

GPU	Kepler GK110	Maxwell GM200	Pascal GP100	Volta GV100	Ampere GA100	Hopper GH100
Compute Capability	3.5	5.3	6.0	7.0	8.0	9/0
Threads / Warp	32	32	32	32	32	32
Max Warps / Multiprocessor	64	64	64	64	64	64
Max Threads / Multiprocessor	2048	2048	2048	2048	2048	2048
Max Thread Blocks / Multiprocessor	16	32	32	32	32	32
Max 32-bit Registers / SM	65536	65536	65536	65536	65536	65536
Max Registers / Block	65536	32768	65536	65536	65536	65536
Max Registers / Thread	255	255	255	255	255	255
Max Thread Block Size	1024	1024	1024	1024	1024	1024
CUDA Cores / SM	192	128	64	64	64	128
Shared Memory Size / SM Configurations (bytes)	16K/32K/48K	96K	64K	96K	164K	228K

NVIDIA HPC / AI GPUs

NVIDIA Tesla Graphics Card	NVIDIA B200	NVIDIA H200 (SXM5)	NVIDIA H100 (SMX5)	NVIDIA H100 (PCIe)	NVIDIA A100 (SXM4)	NVIDIA A100 (PCIe4)	Tesla V100S (PCIe)	Tesla V100 (SXM2)	Tesla P100 (SXM2)	Tesla P100 (PCI-Express)	Tesla M40 (PCI-Express)	Tesla K40 (PCI-Express)
GPU	B200	H200 (Hopper)	H100 (Hopper)	H100 (Hopper)	A100 (Ampere)	A100 (Ampere)	GV100 (Volta)	GV100 (Volta)	GP100 (Pascal)	GP100 (Pascal)	GM200 (Maxwell)	GK110 (Kepler)
Process Node	4nm	4nm	4nm	4nm	7nm	7nm	12nm	12nm	16nm	16nm	28nm	28nm
Transistors	208 Billion	80 Billion	80 Billion	80 Billion	54.2 Billion	54.2 Billion	21.1 Billion	21.1 Billion	15.3 Billion	15.3 Billion	8 Billion	7.1 Billion
GPU Die Size	TBD	814mm2	814mm2	814mm2	826mm2	826mm2	815mm2	815mm2	610 mm2	610 mm2	601 mm2	551 mm2
SMs	160	132	132	114	108	108	80	80	56	56	24	15
TPCs	80	66	66	57	54	54	40	40	28	28	24	15
L2 Cache Size	TBD	51200 KB	51200 KB	51200 KB	40960 KB	40960 KB	6144 KB	6144 KB	4096 KB	4096 KB	3072 KB	1536 KB
FP32 CUDA Cores Per SM	TBD	128	128	128	64	64	64	64	64	64	128	192
FP64 CUDA Cores / SM	TBD	128	128	128	32	32	32	32	32	32	4	64
FP32 CUDA Cores	TBD	16896	16896	14592	6912	6912	5120	5120	3584	3584	3072	2880
FP64 CUDA Cores	TBD	16896	16896	14592	3456	3456	2560	2560	1792	1792	96	960
Tensor Cores	TBD	528	528	456	432	432	640	640	N/A	N/A	N/A	N/A
Texture Units	TBD	528	528	456	432	432	320	320	224	224	192	240
Boost Clock	TBD	~1850 MHz	~1850 MHz	~1650 MHz	1410 MHz	1410 MHz	1601 MHz	1530 MHz	1480 MHz	1329MHz	1114 MHz	875 MHz
TOPs (DNN/AI)	20,000 TOPs	3958 TOPs	3958 TOPs	3200 TOPs	2496 TOPs	2496 TOPs	130 TOPs	125 TOPs	N/A	N/A	N/A	N/A
FP16 Compute	10,000 TFLOPs	1979 TFLOPs	1979 TFLOPs	1600 TFLOPs	624 TFLOPs	624 TFLOPs	32.8 TFLOPs	30.4 TFLOPs	21.2 TFLOPs	18.7 TFLOPs	N/A	N/A
FP32 Compute	90 TFLOPs	67 TFLOPs	67 TFLOPs	800 TFLOPs	156 TFLOPs (19.5 TFLOPs standard)	156 TFLOPs (19.5 TFLOPs standard)	16.4 TFLOPs	15.7 TFLOPs	10.6 TFLOPs	10.0 TFLOPs	6.8 TFLOPs	5.04 TFLOPs
FP64 Compute	45 TFLOPs	34 TFLOPs	34 TFLOPs	48 TFLOPs	19.5 TFLOPs (9.7 TFLOPs standard)	19.5 TFLOPs (9.7 TFLOPs standard)	8.2 TFLOPs	7.80 TFLOPs	5.30 TFLOPs	4.7 TFLOPs	0.2 TFLOPs	1.68 TFLOPs
Memory Interface	8192-bit HBM4	5120-bit HBM3e	5120-bit HBM3	5120-bit HBM2e	6144-bit HBM2e	6144-bit HBM2e	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	384-bit GDDR5	384-bit GDDR5
Memory Size	Up To 192 GB HBM3 @ 8.0 Gbps	Up To 141 GB HBM3e @ 6.5 Gbps	Up To 80 GB HBM3 @ 5.2 Gbps	Up To 94 GB HBM2e @ 5.1 Gbps	Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 1.6 TB/s	Up To 40 GB HBM2 @ 1.6 TB/s Up To 80 GB HBM2 @ 2.0 TB/s	16 GB HBM2 @ 1134 GB/s	16 GB HBM2 @ 900 GB/s	16 GB HBM2 @ 732 GB/s	16 GB HBM2 @ 732 GB/s 12 GB HBM2 @ 549 GB/s	24 GB GDDR5 @ 288 GB/s	12 GB GDDR5 @ 288 GB/s
TDP	700W	700W	700W	350W	400W	250W	250W	300W	300W	250W	250W	235W

NVIDIA Hopper GH100 GPU Unveiled: The World’s First & Fastest 4nm Data Center Chip, Up To 4000 TFLOPs Compute, HBM3 3 TB/s Memory

NVIDIA Hopper GH100 GPU Official: First 4nm & HBM3 Equipped Data Center Chip, 80 Billion Transistors, Fastest AI/Compute Product On The Planet With Up To 4000 TFLOPs of Horsepower

NVIDIA Ampere GH100 Compute

NVIDIA HPC / AI GPUs

Comments

Popular Discussions

NVIDIA To Unveil GeForce RTX 5090 & RTX 5080 At The Same Time, Availability A Few Weeks Apart

NVIDIA’s AIB Partner, Manli, Rejects RMA Request For “Melted” GeForce RTX 4090 GPU, Says Its “User Error”

AMD Ryzen & Intel Core CPUs Available In Crazy Good Deals At Microcenter, Fantastic Bundle Deals Too

AMD RDNA 5 To Be A Completely New GPU Architecture From The Ground Up, RDNA 4 Mostly Fixes RDNA 3 Issues & Improves Ray Tracing

AMD Zen 5 CPUs Rumored To Feature Around 10% IPC Increase, Slightly More In Cinebench R23 Single-Thread Test

NVIDIA Hopper GH100 GPU Unveiled: The World’s First & Fastest 4nm Data Center Chip, Up To 4000 TFLOPs Compute, HBM3 3 TB/s Memory

NVIDIA Hopper GH100 GPU Official: First 4nm & HBM3 Equipped Data Center Chip, 80 Billion Transistors, Fastest AI/Compute Product On The Planet With Up To 4000 TFLOPs of Horsepower

Related Story Samsung Has Reportedly Failed To Pass HBM3E Memory Qualification Tests Set By NVIDIA

NVIDIA Ampere GH100 Compute

NVIDIA HPC / AI GPUs

Further Reading

Comments

Trending Stories

Popular Discussions