The H100 GPU contains four distinct memory tiers, and performance depends on which tier holds the data at any given millisecond.
Modern machine learning models require trillions of parameters and massive datasets. The compute cores that process these numbers are fast enough to calculate them in nanoseconds. The bottleneck is not calculation; it is movement. Data must travel from storage to the processor. This distance creates latency. To manage this, NVIDIA designs memory in a pyramid. The bottom is wide and slow. The top is narrow and fast. This structure is not an option for high-performance computing; it is a physical necessity dictated by the speed of light and silicon density.
The system architecture document for the Hopper architecture, published by NVIDIA, details four specific layers. Each layer serves a different class of data. Registers hold the active calculation. L1 cache holds the immediate instructions. L2 cache holds the shared weights. High Bandwidth Memory holds the entire model state. Moving data up this pyramid is expensive. Moving it down is slow. The efficiency of a training run depends on how often the software can keep the necessary data at the top.
The memory hierarchy
| Tier | Capacity | Bandwidth | Latency | Primary Scope |
|---|---|---|---|---|
| Registers | 256KB | ~100TB/s | 1 cycle | Per Streaming Multiprocessor |
| L1 / Shared | 228KB | ~33TB/s | <100 cycles | Per Streaming Multiprocessor |
| L2 Cache | 50MB | ~10TB/s | ~500 cycles | Per GPU |
| HBM3 | 80GB | 3.3TB/s | >1,000 cycles | Per GPU |
The numbers reveal the tradeoff. Capacity and bandwidth move in opposite directions. The 80GB of HBM3 stores the full model, but its 3.3TB/s bandwidth limits the rate at which the processor can feed on that data. The registers offer roughly 100TB/s of effective bandwidth, but only 256KB of space. If the active calculation fits in registers, it runs at full speed. If the calculation spills over to HBM3, the processor waits. This waiting time is called the “memory wall,” a term coined in research published by IEEE Micro to describe the growing gap between processor speed and memory latency.
The H100 architecture mitigates this gap through data locality. CUDA kernels — whether handwritten or generated by frameworks like PyTorch or JAX — must explicitly move data from HBM3 into L1 and registers before computation begins. If the data is not there, the compute cores sit idle. This idle time is not theoretical. In large language model training, 60% to 70% of total step time is typically spent on memory transfers rather than the FLOPS themselves (per NVIDIA’s own Nsight Systems profiling guidance). The hardware is capable of more; the memory system is the limit.
The cost of a memory miss scales with the tier. A miss in the register file is rare because the compiler manages it tightly. A miss in the L1 cache forces a fetch from L2. A miss in L2 forces a fetch from HBM3. Each step up the hierarchy increases latency by orders of magnitude. The L1 cache offers 33TB/s, which is ten times faster than the 3.3TB/s of HBM3. A workload that stays in L1 runs ten times faster than one that constantly accesses HBM3. This difference determines whether a 70-billion parameter model trains in a month or a year.
This hierarchy also dictates hardware cost. HBM3 is extremely expensive to manufacture because it stacks memory dies vertically using through-silicon vias. The 80GB capacity is critical for fitting the entire model weights without constant swapping. The L2 cache, at 50MB, is a buffer that smooths out bursts of demand. It is large enough to hold the weights for a single layer of a transformer block but not the whole model. Registers are the cheapest in terms of space per byte because they are part of the logic gates themselves, but they are the most expensive in terms of chip area usage.
The synthesis of these tiers is a balance between capacity and throughput. The 80GB of HBM3 provides the capacity to load a massive model. The 33TB/s of L1 provides the throughput to process that model quickly. The 256KB of registers provide the immediate execution environment. If the software fails to utilize the L1 cache effectively, the processor is starved. If the model exceeds the 80GB HBM3 limit, the system must swap data to CPU RAM, which drops bandwidth to roughly 100GB/s. This drop is catastrophic for training efficiency.
The engineering challenge is to maximize the use of the top three tiers while keeping the bottom tier fully loaded. Kernel developers write custom CUDA code for specific model layers, aiming to keep weights in L1 and active calculations in registers. Reducing the traffic between HBM3 and L2 is the lever that converts theoretical FLOPS into delivered FLOPS.
The closer
Every 1GB moved from HBM3 takes roughly 300 microseconds at 3.3TB/s. The same 1GB moved through L1 takes roughly 30 microseconds at 33TB/s. For a training run that processes 100TB of data, the difference between holding the working set in L1 versus HBM3 is roughly 27,000 seconds — 7.5 hours, or one day shaved off a week-long training job. The 80GB of HBM3 is the reservoir; the 228KB of L1 is the engine. A model architecture that lifts register reuse by 10% saves more wall-clock time than buying a faster HBM3. The bill for using more capacity is paid in time, and the lever that lowers the bill is the working-set size.