Highlights from Jensen Huang's Keynote 2024 GTC

NVIDIA Blackwell GPU Architecture Overview

By Kateryna M

Key Highlights from Jensen Huang’s Keynote 2024 GTC

  • Introduction of Blackwell Architecture: A leap forward in AI technology, promising to revolutionize various industries.
  • Project GROOT Announcement: Showcases advancements towards the human-robot future, bringing us closer to more interactive and capable robots.
  • Collaboration with Apple Vision Pro: Nvidia’s venture into the Omniverse, aiming to enhance virtual reality experiences.
  • Digital Twin of the Planet: An ambitious project to recreate the entire planet digitally for better weather forecasting and more.

Notable Announcements and Developments

  • Humanoid Robots in Development: Demonstrated potential applications across factories, healthcare, and science, signaling the future may not be far off.
  • Isaac Perceptor SDK: New software aimed at robotic arms and vehicles, providing them with greater insight and intelligence.
  • Omniverse Cloud Streaming to Apple Vision Pro: A step towards more immersive virtual environments, powered by Nvidia’s cutting-edge technology.
  • Partnership with Nissan: Highlighted the potential for AI in customizing new car options, showcasing a blend of technology and consumer choice.
  • Siemens Collaboration: Nvidia’s technology will be used to boost productivity and efficiency in virtual warehouses, marking significant industrial application.
  • Advancements in Healthcare: Nvidia building models to aid researchers worldwide, speeding up drug discovery processes.
  • Earth-2 APIs for Better Weather Forecasting: A collaboration with The Weather Company to improve predictions and save lives.

NVIDIA Blackwell GPU Architecture Overview

Introduction

The NVIDIA Blackwell architecture introduces a significant leap forward in generative AI accelerator technology, incorporating the B200 and B100 accelerators. Named in honor of Dr. David Harold Blackwell, a pioneer in statistics and mathematics, this next-generation architecture is designed to redefine performance, flexibility, and efficiency in the realms of datacenter and high-performance computing (HPC).

Blackwell Comparison

Nvidia Blackwell Chips

Specification GB200 B200 B100 HGX B200 HGX B100
Configuration 2x B200 GPU, 1x Grace CPU Blackwell GPU Blackwell GPU 8x B200 GPU 8x B100 GPU
FP4 Tensor Dense/Sparse 20/40 petaflops 9/18 petaflops 7/14 petaflops 72/144 petaflops 56/112 petaflops
FP6/FP8 Tensor Dense/Sparse 10/20 petaflops 4.5/9 petaflops 3.5/7 petaflops 36/72 petaflops 28/56 petaflops
INT8 Tensor Dense/Sparse 10/20 petaops 4.5/9 petaops 3.5/7 petaops 36/72 petaops 28/56 petaops
FP16/BF16 Tensor Dense/Sparse 5/10 petaflops 2.25/4.5 petaflops 1.8/3.5 petaflops 18/36 petaflops 14/28 petaflops
TF32 Tensor Dense/Sparse 2.5/5 petaflops 1.12/2.25 petaflops 0.9/1.8 petaflops 9/18 petaflops 7/14 petaflops
FP64 Tensor Dense 90 teraflops 40 teraflops 30 teraflops 320 teraflops 240 teraflops
Memory 384GB (2x8x24GB) 192GB (8x24GB) 192GB (8x24GB) 1536GB (8x8x24GB) 1536GB (8x8x24GB)
Bandwidth 16 TB/s 8 TB/s 8 TB/s 64 TB/s 64 TB/s
NVLink Bandwidth 2x 1.8 TB/s 1.8 TB/s 1.8 TB/s 14.4 TB/s 14.4 TB/s
Power Up to 2700W 1000W 700W 8000W? 5600W?

Features and Capabilities

Dual-Die Chiplet Design

  • Integrates two reticle-sized GPU dies within a single package, embracing a chiplet-based approach for flagship accelerators.
  • Enables substantial increases in transistor count, computing power, and memory capacity.

Transistor Count and Memory

  • Features a total of 208 billion transistors (104 billion per die), marking a 30% enhancement over its predecessors.
  • Equipped with 8 stacks of HBM3E memory, providing 192GB of VRAM and an impressive 8TB/sec of memory bandwidth—nearly 2.4x the bandwidth of the H100 accelerator.

Compute Performance

  • Optimized tensor cores support operations down to FP4 precision, enabling up to 10 PetaFLOPS of FP8 performance and 20 PFLOPS of FP4 performance for inference tasks, demonstrating significant progress in AI and machine learning capabilities.
  • Utilizes the NV-High Bandwidth Interface (NV-HBI), offering 10TB/second of bandwidth for inter-die communication. This feature ensures the two dies operate seamlessly as one unified CUDA GPU, eliminating potential performance bottlenecks.

Efficiency and Performance

  • Despite adhering to a 4nm-class TSMC 4NP manufacturing process, Blackwell achieves a 4x increase in training performance and a 30x improvement in inference performance, all while realizing 25x greater energy efficiency compared to earlier generations.

DGX B200 specifications

Specification Details
GPU 8x NVIDIA B200 Tensor Core GPUs
GPU Memory 1,440GB total GPU memory
Performance 72 petaFLOPS training and 144 petaFLOPS inference
Power Consumption ~14.3kW max
CPU 2 Intel® Xeon® Platinum 8570 Processors (112 Cores total, 2.1 GHz Base, 4 GHz Max Boost)
System Memory Up to 4TB
Networking 4x OSFP ports serving 8x single-port NVIDIA ConnectX-7 VPI (Up to 400Gb/s InfiniBand/Ethernet)
  2x dual-port QSFP112 NVIDIA BlueField-3 DPU (Up to 400Gb/s InfiniBand/Ethernet)
Management Network 10Gb/s onboard NIC with RJ45, 100Gb/s dual-port ethernet NIC, Host BMC with RJ45
Storage OS: 2x 1.9TB NVMe M.2, Internal: 8x 3.84TB NVMe U.2
Software NVIDIA AI Enterprise, NVIDIA Base Command™, DGX OS / Ubuntu
Rack Units (RU) 10 RU
System Dimensions Height: 17.5in (444mm), Width: 19.0in (482.2mm), Length: 35.3in (897.1mm)
Operating Temperature 5–30°C (41–86°F)
Enterprise Support Three-year Enterprise Business-Standard Support, 24/7 portal access, Live agent support during business hours
Share: Twitter Facebook LinkedIn