2025-08-03 06:15:03
Zen 5 is AMD’s newest core architecture. Compared to Zen 4, Zen 5 brings more reordering capacity, a reorganized execution engine, and numerous enhancements throughout its pipeline. In short, Zen 5 is wider and deeper than Zen 4. Like Lion Cove, Zen 5 delivers clear gains in the standard SPEC CPU2017 benchmark as well as many productivity applications. And like Lion Cove, several reviewers have criticized the non-X3D Zen 5 variants for not delivering similar gains in games. Here, I’ll be testing a few games on the Ryzen 9 9900X, with DDR5-5600 memory. It’s a somewhat slower memory configuration than I tested Lion Cove with, largely for convenience. I swapped the 9900X in place of my previous 7950X3D, keeping everything else the same. The memory used is a 64 GB G.SKILL DDR5-5600 36-36-36-89 kit.
The Ryzen 9 9900X was kindly sampled by AMD, as is the Radeon RX 9070 used to run the games here. I’ll be using the same games as in the Lion Cove gaming article, namely Palworld, COD Cold War, and Cyberpunk 2077. However, the data is not directly comparable; I’ve built up my Palworld base since then, COD Cold War multiplayer sessions are inherently unpredictable, and Cyberpunk 2077 received an update which annoyingly forces a 60 FPS cap, regardless of VSYNC or FPS cap settings. My goal here is to look for broad trends rather than do a like-for-like performance comparison.
As with the previous article on Lion Cove, top-down analysis will provide a starting point by accounting for lost pipeline throughput at the rename/allocate stage. It’s the narrowest stage in the pipeline, so throughput lost there can’t be recovered later and results in lower utilization of core width. Lion Cove was heavily bound by backend memory latency, with frontend latency causing additional losses. Zen 5 hits those issues in reverse. From a top-down view, it struggles to keep its frontend fed. Backend memory latency is still significant, but it is overshadowed by frontend latency.
A pipeline slot is considered frontend latency bound if the frontend left all eight rename/allocate slots idle that cycle. Backend bound refers to when the rename/allocate stage had micro-ops to dispatch, but the execution engine ran out of entries in its various buffers and queues. AMD breaks down backend bound slots into core-bound and memory-bound categories, by looking at how often the retirement stage is blocked by an incomplete load versus an incomplete instruction of another type. That’s because backend-bound stalls come up when the execution engine is unable to clear out (retire) instructions faster than the frontend supplies them. Bad speculation looks at the difference between micro-ops that pass through the rename/allocate stage, and ones that were actually retired. That gives a measure of wasted work caused by branch mispredicts and other late-stage redirects like exceptions and interrupts. It’s a negligible factor in all three games.
SMT contention doesn’t indicate lost core throughput. Rather, core performance counters work on a per-SMT thread basis, and SMT contention indicates when the thread had micro-ops ready from the frontend, but the rename/allocate stage serviced the sibling SMT thread that cycle. A very high SMT contention metric could indicate that a single thread can already use much of the core’s throughput, and thus SMT gains may be limited; however, that’s not the case here. Finally, the “retiring” metric corresponds to useful work and indicates how effectively the workload uses core width. It’s relatively low on the three games here, providing a first indication that games could be considered a “low-IPC” workload.
Zen 5’s frontend combines a large 6K entry op cache with a 32 KB conventional instruction cache. To hide L1i miss latency, Zen 5 uses a decoupled branch predictor with a massive 24K BTB entries. Zen 5’s op cache covers the majority of the instruction stream on all three games, and enjoys a higher hitrate than the 5.2K entry op cache on Lion Cove. The L1i catches a substantial portion of op cache misses, though misses per instruction as calculated by L1i refills looks higher than on Lion Cove. 20-30 L1i misses per 1000 instructions is also a bit high in absolute terms, and Zen 5’s 1 MB L2 does a good job of catching nearly all of those misses. Still, a few occasionally slip past and come from the higher latency L3 just like on Lion Cove.
Branch prediction accuracy is high, though curiously slightly worse across all three titles than on Lion Cove. That’s surprising because Zen 5 managed better accuracy across SPEC CPU2017, with notable wins in difficult subtests like 541.leela. There’s a high margin of error in these comparisons, and the aforementioned changes to the tested scenes, but the consistent difference in both prediction accuracy and mispredicts per instruction is difficult to ignore.
Mispredicts interrupt the branch predictor’s ability to run ahead of instruction fetch, and expose the frontend to cache latency as it waits for instructions to arrive from the correct path. A mispredict is a comparatively expensive redirect because it affects both the frontend latency and bad speculation categories. Another form of redirect comes from “decoder overrides” in AMD terminology, or “BAClears” (branch address clear) in Intel terms. These happen when the core discovers a branch in frontend stages after the branch predictor, typically when seeing a branch for the first time or when the branch footprint goes beyond the predictor’s tracking capabilities. A redirect from later frontend stages prevents bad speculation losses, but it does expose the core to L1i miss latency. Zen 5 surprisingly takes a few more decoder overrides than Intel does BAClears. It’s not great when its branch predictor needs to cover for more L1i misses in the first place.
Delays within the branch predictor can also cause frontend latency. Zen 5’s giant BTB is split into a 16K entry first level and a 8K entry second level. Getting a target from the second level is slower. Similarly, an override from the indirect predictor would cause bubbles in the branch prediction pipeline. However, I expect overrides from within the branch predictor to be a minor factor. The branch predictor can still continue following the instruction stream to hide cache latency, and short delays will likely be hidden by backend stalls in a low IPC workload.
Stalls at the renamer due to frontend latency last for 11-12 cycles on average. Zen 5 doesn’t have events that can directly attribute instruction fetch stalls to L1i misses, unlike Lion Cove. 11-12 cycles however is suspiciously close to L2 latency. Of course, the average will be pulled lower by shorter stalls when the pipeline is resteered to a target in the L1i or op cache, as well as longer stalls when the target comes from L3 or DRAM.
While frontend latency bound slots are a bigger factor, Zen 5 does lose some throughput due to being frontend bandwidth bound. Zen 5’s frontend spends much of its active time running in op cache mode. Surprisingly, average bandwidth from the op cache is a bit under the 8 micro-ops/cycle that would be required to fully feed the core. The op cache can nominally deliver 12 micro-ops per cycle, but average throughput hovers around 6 micro-ops per cycle.
One culprit is branches, which can limit the benefits of widening instruction fetch: op cache throughput correlates negatively with how frequently branches appear in the instruction stream. The three games I tested land in the middle of the pack when placed next to SPEC CPU2017’s workloads. Certainly there’s room for improvement with frontend bandwidth too. But that room for improvement is limited because frontend bandwidth bound slots are few to start with.
Average decoder throughput is just under 4 micro-ops per cycle. The decoders are only active for a small minority of cycles, so they represent a small portion of the already small frontend bandwidth bound category. Certainly wider decoders wouldn’t hurt, but the impact would be insignificant.
Backend bound cases occur when the out-of-order execution engine runs out of entries in its various queues, buffers, and register files. That means it cannot accept more micro-ops from the renamer until it can retire some instructions and free up entries in those structures. In other words, the core has reached the limit of how far it can move ahead of a stalled instruction.
Zen 5’s integer register file stands out as a “hot” resource, often limiting reordering capacity before the core’s reorder buffer (ROB) fills. There’s a good chunk of resource stalls that performance monitoring events can’t attribute to a more specific category. Beyond that, Zen 5’s backend design has some bright points. The core’s large unified schedulers are rarely a limiting factor, unlike on TaiShan v110. Zen 5’s reorganized FPU places FP register allocation after a large non-scheduling queue, which basically eliminates FP-related reasons in the resource stall breakdown. In fairness, the older Zen 4 also did well in that category, even though its FP non-scheduling queue serves a more limited purpose of handling overflow from the scheduling queue. All of that means Zen 5 can often fill up its 448 entry ROB, and thus keep a lot of instructions in flight to hide backend memory latency.
When the backend fills up, it’s often due to cache and memory latency. Zen 5’s retire stage is often blocked by an incomplete load, and when it is, it tends to remain blocked longer than average, for around 18-20 cycles. That’s a sign that incomplete loads have longer latency than most instructions.
Average L1d miss duration is much longer and broadly comparable to similar measurements on Lion Cove. Curiously, Zen 5 sees higher average L1d miss latency in COD Cold War and Cyberpunk 2077, but lower latency in Palworld. But that’s not because Zen 5 has an easier time in Palworld. Rather, the core sees more L1d misses in Palworld. Those misses are largely caught by L2.
Comparing data from AMD and Intel is difficult for reasons beyond the typical margin of error considerations. Intel’s performance monitoring events account for load data sources at the retirement stage, likely by tagging in-flight load instructions with the data source and counting at retirement. AMD counts at the load/store unit, which means Zen 5 may count loads that don’t retire (for example, ones after a mispredicted branch). The closest I can get is using Zen 5’s event for demand data loads. Demand means the access was initiated by an instruction, as opposed to a prefetch. That should minimize the gap, but again, the focus here is on broad trends.
Caveats aside, Palworld seems to make a compelling case for Intel’s 192 KB L1.5d cache. It catches a substantial portion of L1d misses and likely reduces overall load latency compared to Zen 5. On the other hand, Zen 5’s smaller 1 MB L2 has lower latency than Intel’s 3 MB L2 cache. AMD also tends to satisfy a larger percentage of L1d misses from L3 in Cyberpunk 2077 and COD. Intel’s larger L2 is doing its job to keep data closer to the core, though Intel needs it because their desktop platform has comparatively high L3 latency.
On Zen 5, L3 hitrate for demand (as opposed to prefetch) loads comes in at 64.5%, 67.6%, and 55.43% respectively. Most L3 misses head to DRAM. Cross-CCX transfers account for a negligible portion of L3 miss traffic: sampled latency events at the L3 indicate cross-CCX latency is similar to or slightly better than DRAM latency. Therefore cross-CCX transfers are quite performant considering their rarity.
Performance monitoring data from running these games normally does not support the hypothesis that improving cross-CCX latency is likely to yield significant benefits, on account of their rarity, and better performance than DRAM accesses. DRAM latency from the L3 miss perspective is slightly better in Palworld and Cyberpunk 2077 compared to Intel's Arrow Lake, even though the Ryzen 9 9900X was set up with older and slower DDR5. The situation reverses in Call of Duty Cold War, perhaps indicating bursty demands for DRAM bandwidth in that title.
Running the games above in the normal manner did not generate considerable cross-CCX traffic. Dual-CCD Ryzen parts have the highest clocking cores located on one CCX, and Windows will prefer to schedule threads on higher clocking cores. This naturally makes one CCX handle the bulk of the work from games.
However, I can try to force increased cross-CCX traffic by setting the game process’s core affinity to split it across the Ryzen 9900X’s two CCX-es. I used Cyberpunk’s medium preset with upscaling disabled for the tests above. However for this test, I’m maximizing CPU load by using the low preset, FSR Quality upscaling, and crowd density turned up the maximum setting. I’ve also turned off boost to eliminate the effects of clock speed variation across cores.
Doing so drops performance by 7% compared to pinning the game to one CCX. Cyberpunk 2077’s built-in benchmark does display a few percentage points of run-to-run variation, but 7% is outside the margin of error. Monitoring all L1D fill sources indicates a clear increase in cross-CCX access. This metric differs from demand accesses and is even less comparable to Lion Cove’s at-retirement account for load sources, because it includes prefetches. However, it does account for cases where a load instruction matches an in-flight fill request previously initiated by a prefetch. There’s also more error because useless prefetches will be included as well.
Performance monitoring events attribute this to cross-CCX accesses, which now account for a significant minority of off-CCX accesses. Fewer accesses are serviced by low latency data sources within the CCX, leading to reduced performance.
Games are difficult, low-IPC workloads thanks to poor locality on both the data and instruction side. Zen 5’s lower cache and memory latency can give it an advantage on the data side, but the core remains underutilized due to frontend latency. Lion Cove’s 64 KB L1i is a notable advantage, unfortunately blunted by high L3 and DRAM latency on the Arrow Lake desktop platform. It’s interesting how Intel and AMD’s current cores face similar challenges in games, but comparatively struggle on different ends of the core pipeline.
From a higher level view, AMD and Intel’s designs this generation appear to prioritize peak throughput rather than improving performance in difficult cases. A wider core with more execution units will provide the biggest benefits in high IPC workloads, where the core may spend a significant portion of time with plenty of instructions and data to chug through; SPEC CPU2017’s 538.imagick and 548.exchange2 are excellent examples. In contrast, workload with low average IPC will have far fewer “easy” high IPC sequences, limiting potential benefits from increased core throughput.
Of course, Zen 5 and Lion Cove both take measures to tackle those lower IPC workloads. Better branch prediction and increased reordering depth are staples of each new CPU generation; they’re present on Zen 5 and Lion Cove, too. But pushing both forwards runs into diminishing returns: catching the few remaining difficult branches likely requires disproportionate investment in branch prediction resources. Similarly, increasing reordering capacity requires proportionate growth in register files, load/store queues, schedulers, and other expensive core structures.
Besides making the core more latency-tolerant, AMD and Intel could try to reduce latency through better caching. Intel’s 64 KB L1i deserves credit here, though it’s not new to Lion Cove and was present on the prior Redwood Cove core as well. Zen 5’s L2/L3 setup is largely unchanged from Zen 4. L2 caches are still 1 MB, and the CCX-shared 32 MB or 96 MB (with X3D) L3 is still in place. A hypothetical core with both Intel’s larger L1i and AMD’s low latency caching setup could be quite strong indeed, and any further tweaks in the cache hierarchy would further sweeten the deal.
System topology is potentially an emerging concern. High core counts on today’s flagship desktop chips are a welcome change from the quad core pattern of the early 2010s, but achieving high core counts with a scalable, cost-effective design is always a challenge. AMD approaches this by splitting cores into clusters (CCX-es), which has the downside of increased latency when a core on one cluster needs to read data recently written by a core on another cluster. The three games I tested do the bulk of their work on one CCX, but a game that spills out of one CCX can see its multithreaded performance scaling limited by this cross-CCX latency.
For now, this doesn’t appear to be a major problem, contrary to the criticism AMD often takes for high cross-CCX latency. Forcing a game to run on three cores per CCX is a very artificial scenario, and AMD has not used a split CCX setup on their low end 6-core parts since Zen 2. Data from other reviewers covering a larger variety of games suggests multi-CCX parts turn in similar performance to their single-CCX counterparts, with the exception of Zen 4 parts where the two CCX-es have very different clock speeds and cache capacities. AMD’s practice of placing all fast cores on one CCD and more recently, parking cores on a CCD, seem to be working well.
Similarly, Intel’s “Thread Director” has done a good job of ensuring games don’t lose performance by being scheduled onto the density optimized E-Cores. TechPowerUp notes performance differences with E-Cores enabled or disabled rarely exceeded 10% in either direction. Thus the core count scaling techniques employed by both AMD and Intel don’t impact gaming performance to a significant degree. Today’s high end core counts may become mainstream in tomorrow’s midrange or low end chips, and future games may spill out of a 9900X’s CCX or a 285K’s P-Cores. But even in that case, plain cache and memory latency will likely remain the biggest factors holding back gaming performance.
With that in mind, I look forward to seeing what next generation cores will look like. AMD, Intel, and any other CPU maker has to optimize their cores for a massive variety of workloads. Games are just one workload in the mix, alongside productivity applications, high performance computing, and server-side code. However, I would be pleased if the focus shifted away from pushing high IPC cases to even higher IPC, towards maintaining better performance in difficult low IPC cases.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-07-23 07:48:18
Huawei is one of China’s largest technology companies, with enterprise products spanning everything from servers to networking equipment. All of those products require advanced chips to remain relevant. Huawei has invested in developing its own chips with its HiSilicon subsidiary, which lets Huawei both tailor chip designs to its requirements and safeguard its business against supply chain disruptions. Kunpeng 920 is a chiplet-based CPU design that targets a variety of enterprise applications including cloud servers, AI accelerators, and wireless base stations.
Here, we’re looking at a 24 core Kunpeng 920 CPU subsystem found in a Huawei network card.
Special thanks goes out to Brutus for setting this one up!
Kunpeng 920 uses multiple dies with TSMC’s CoWoS packaging to implement what HiSilicon calls “LEGO-based production”. HiSilicon’s chiplet strategy uses dies of equal height placed side-by-side, with compute dies in the center and IO dies on the side. Compute dies are called Super CPU Clusters (SCCLs) and include DDR4 controllers on the top and bottom edges of the die, which uses all of the chip’s edge area for off-chip interfaces. The SCCLs are fabricated on TSMC’s 7nm process and contain up to 32 TaiShan v110 CPU cores with L3 cache. A separate IO die uses TSMC’s 16nm node, and connects to PCIe, SATA, and other lower speed IO. All dies sit on top of a 65nm interposer.
Inter die bandwidth is able to achieve up to 400 GB/s with coherency
Kunpeng 920: The First 7-nm Chiplet-Based 64-Core ARM SoC for Cloud Services
HiSilicon’s LEGO-based production has parallels to Intel’s chiplet strategy, which similarly emphasizes high cross-die bandwidth at the cost of more expensive packaging technologies and tighter distance limits between dies. Like Intel’s Sapphire Rapids, placing memory controllers at the CPU dies lets smaller SKUs access DRAM without needing to route memory requests through another chiplet. Sapphire Rapids uses its high cross-die bandwidth to make its multi-die setup appear monolithic to software. L3 and DRAM resources can be seamlessly shared across dies, in contrast to a NUMA setup where software has to work with different memory pools.Strangely, I wasn’t able to find any evidence that Kunpeng 920 could combine L3 and DRAM resources across multiple SCCLs.
Kunpeng 920 supports dual and quad socket configurations using Huawei’s “Hydra” links, which helps scale core counts further. Contemporary server processors with similar per-socket core counts, like Ampere Altra and AMD’s Zen 2, only scale up to dual socket configurations.
TaiShan v110 cores within a SCCL compute die are grouped into quad core CPU clusters (CCLs). A bidirectional ring bus links blocks on the compute die, including CPU clusters, L3 data banks, memory controllers, and links to other dies. L3 data banks are paired with CPU clusters, but appear to sit on separate ring stops instead of sharing one with a CPU cluster as they do on Intel and AMD designs. A fully enabled SCCL with eight CPU clusters has 21 ring stops. Our 24 core SKU likely has two CPU clusters and L3 banks disabled, though it’s not clear whether the ring stops remain active
Unusually, Huawei places L3 tags at the CPU clusters rather than at the L3 data banks. The L3 can also operate in different modes. “Shared” mode behaves like the L3 on AMD, Arm, and Intel chips, using all L3 banks together to form a large shared cache. Presumably the physical address space is hashed across L3 banks, evenly distributing accesses across the data banks to scale bandwidth while preventing data duplication. “Private” mode makes a L3 bank private to the closest CPU cluster, improving L3 performance by taking much of the interconnect out of the picture. A third “partition” mode can adjust each core cluster’s private L3 capacity on the fly. Huawei’s paper implies partition mode can also dynamically adjust L3 policy between shared and private mode, handling situations where different tasks or even phases of the same task prefer private or shared L3 behavior.
Partition mode is the default, and the only mode on the test system. Some Kunpeng 920 systems allow setting L3 cache policies in the BIOS, but the test system does not have a BIOS interface and cache control settings are not exposed through UEFI variables. With partition mode, a core sees reasonable 36 cycle L3 latency out to just under 4 MB. Latency gradually increases at larger test sizes as the private L3 portion expands to include nearby L3 slices. Finally, latency exceeds 90 cycles as test sizes approach L3 capacity.
If another core traverses the same test array, L3 latency becomes uniformly high throughout its capacity range. Latency reaches the >90 cycle range even when the test array only slightly spills out of L2, suggesting the L3 is operating in shared mode. Surprisingly, data sharing between two cores in the same cluster triggers similar behavior. Perhaps the L3 enters shared mode when cachelines are in shared state, or can’t cache lines in shared state within a cluster’s private L3 partition.
This isn’t always an optimal strategy. For example, two cores sharing a 2 MB array would be better served with that array kept within private L3 partitions. Data duplication isn’t a problem if the L3 isn’t under capacity pressure in the first place. Lack of a special case for data sharing within a cluster is also baffling.
From one angle, Kunpeng 920’s partition mode is an advantage because it exploits how L3 banks are placed closer to certain cores. AMD, Intel, and most Arm chips have the same non-uniform L3 latency characteristics under the hood, but don’t try to place L3 data closer to the core using it. From another angle though, partition mode tries to cover for poor interconnect performance. Kunpeng 920 has worse L3 latency than Intel’s Sapphire Rapids when the L3 is operating in shared mode, or when a single core uses the entire L3. That’s brutal with just 512 KB of L2 per core. I lean towards the latter view because even core-private accesses to the closest L3 slice have the same cycle count latency as Zen 2’s L3, which distributes accesses across L3 banks. Zen 2 maintains uniformly low latency throughout the L3 capacity range with either a shared or private read pattern. Thus Kunpeng 920’s partition mode is best seen as a mechanism that can sometimes cover for a high latency interconnect.
A quad core TaiShan v110 cluster can achieve 21.7 GB/s of L3 read bandwidth, so Kunpeng 920 has cluster-level bandwidth pinch points much like with Intel’s E-Core clusters. However, those bandwidth pinch points are more severe on Kunpeng 920. Bandwidth contention from sibling cores within a cluster can bring L3 latency to over 80 ns. Intel’s design also sees a latency increase, as tested on Skymont, but has overall lower latency and higher bandwidth.
DRAM access is provided by a pair of dual channel DDR4 controllers positioned at the top and bottom edges of the compute die, which are connected to 32 GB of DDR4-2400 in the test setup. Read bandwidth was measured at 63 GB/s using a read-only pattern. Unloaded latency is good for a server chip at 96 ns, though latency quickly steps up to over 100 ns under moderate bandwidth load. Pushing bandwidth limits can send latency beyond 300 ns. While not great, it’s better controlled than the near 600 ns latency that Qualcomm Centriq can reach.
Kunpeng 920 delivers reasonable latency when bouncing cachelines within a quad core cluster. Cross-cluster accesses incur significantly higher latency, and likely vary depending on where the shared cacheline is homed.
Latency is higher for both the intra-cluster and cross-cluster cases compared to AMD’s Zen 2, at least on a desktop platform.
Cache-to-cache transfers are rare in practice, and I only run this test to show system topology and to provide clues on how cache coherency might be handled. Core to core latency is unlikely to have significant impact on application performance.
HiSilicon’s TaiShan v110 is a 64-bit ARM (aarch64) core with 4-wide out-of-order execution. It’s Huawei’s first custom core design. While Huawei previously used Arm’s Cortex A57 and A72 in server SoCs, TaiShan v110 does not appear to have much in common with those older Arm designs.
The core has modest reordering capacity, three integer ALUs, a dual-pipe FPU, and can service two memory operations per cycle. It’s broadly comparable to Intel’s Goldmont Plus from a few years before, but is slightly larger than Goldmont Plus and enjoys a much stronger server-class memory subsystem. Arm’s Neoverse N1 is another point of comparison, because it’s another density-optimized aarch64 core implemented on TSMC’s 7nm node. Neoverse N1 is also 4-wide, but has a somewhat larger out-of-order engine.
Huawei’s publications say TaiShan v110 uses a “two-level dynamic branch predictor”. A two-level prediction uses the branch address and prior branch outcomes to index into a table of predicted branch outcomes. It’s a relatively simple prediction algorithm that fell out of favor in high performance designs as the 2010s rolled around. Huawei could also be referring to a “two-level” BTB setup, or a sub-predictor that creates two overriding levels. From a simple test with conditional branches that are either taken or not-taken in random patterns of varying length, TaiShan v110 behaves a bit like Arm’s Cortex A73.
A 64 entry BTB provides taken branch targets with single cycle latency, allowing for zero-bubble taken branches. Past that, the branch predictor handles branches with 3 cycle latency as long as code fits within 32 KB and branches are not spaced too close together. Branches spaced by 16B or less incur an extra penalty cycle, and anything denser performs poorly. Spilling out of L1i dramatically increases taken branch latency. With one branch per 64B cacheline, latency reaches 11-12 cycles from L2, or beyond 38 cycles when code has to be fetched from L2. That roughly lines up with data-side L2 and private L3 latencies, suggesting the branch predictor is unable to run ahead of the rest of instruction fetch to drive prefetching.
A 31 entry return stack handles returns. For the more generalized case of indirect branches, an indirect predictor can track up to 16 cycles per branch, or approximately 256 total indirect targets before taking significant penalties.
Branch prediction accuracy in SPEC CPU2017 roughly matches that of Intel’s Goldmont Plus, though Goldmont Plus takes a win by the slimmest of margins. AMD’s Zen 2 from around the same time turns in a stronger performance, and shows what can be expected from a high performance core on TSMC’s 7nm node.
TaiShan v110 has a 64 KB instruction cache, which can supply the core with four instructions per cycle. Instruction-side address translations use a 32 entry iTLB, which is backed by a 1024 entry L2 TLB. The L2 TLB may be shared with data access, but I currently don’t have a test written to check that. Instruction fetch bandwidth drops sharply as code spills out of L1i, to 6-7 bytes per cycle on average. That makes L2 code read bandwidth somewhat worse than on Intel’s Goldmont Plus or Arm’s Neoverse N1. Code bandwidth from L3 is very poor and is about as bad as fetching instructions from DRAM on Goldmont Plus or Neoverse N1.
Instructions are decoded by a 4-wide decoder that translates them to micro-ops. Then, the core carries out register renaming and allocates other backend resources to track them and enable out-of-order execution. The renamer can carry out move elimination.
TaiShan v110 uses a PRF-based execution scheme, where register values are stored in physical register files and other structures store pointers to those register file entries. Reorder buffer capacity is similar to Intel’s Goldmont Plus, but TaiShan v110’s larger integer register file and memory ordering queues should put it ahead. The scheduler layout splits micro-ops into ALU, memory access, and FP/vector categories, and uses a separate unified scheduler for each. Each scheduler has approximately 33 entries. Goldmont Plus uses a distributed scheduler layout on the integer side, while Arm’s Neoverse N1 uses a purely distributed scheduler layout.
SPEC CPU2017’s workloads heavily pressure TaiShan v110’s schedulers. Scheduler entries are often a “hot” resource on any core, so that’s not unusual to see. Integer register file capacity is rarely an issue because the core has nearly enough integer registers to cover ROB capacity. TaiShan v110 renames flags (condition codes) in a separate register file with approximately 31 renames available. Flag renames are rarely an issue either, except in very branch heavy workloads like file compression and 505.mcf.
Floating point workloads put pressure on the FP/vector register file in addition to the FPU’s scheduler. TaiShan v110 likely has similar FP/vector register file capacity to Goldmont Plus, but has fewer available for renaming because aarch64 defines 32 FP/vector registers instead of 16 on x86-64. A larger register file would help better balance the core for FP and vector workloads.
TaiShan v110’s integer execution side has four ports. Three are general purpose ALUs that handle simple and common operations like integer adds and bitwise operations. Branches can go down two of those ports, though the core can only sustain one taken branch per cycle much like other cores of the era. The fourth port is specialized for multi-cycle integer operations like multiplies and divides. Integer multiples execute with four cycle latency. Goldmont Plus and Neoverse N1 similarly have a 3+1 integer port setup, but both place branches on the fourth port instead of using it for multi-cycle operations. Putting branches on the fourth port may be slightly better for throughput, because branches tend to be more common than multi-cycle operations. Placing branches on a dedicated port also naturally prioritizes them because no other instruction category contends for the same port. That can help discover mispredicts faster, reducing wasted work. On the other hand, TaiShan v110’s layout likely simplifies scheduling by grouping ports by latency characteristics.
The FPU on TaiShan v110 has two ports, which is common for low power and density optimized designs of the period. Both ports can handle floating point fused multiply-add operations with 128-bit vector length with FP32 operations. FP64 operations execute at quarter rate. FP32 FMA operations have 5 cycle latency. Strangely, FP32 adds and multiplies can only be serviced by a single port each, even though both have the same 5 cycle latency as FMA operations. Vector integer adds can use both ports and have 2 cycle latency. Only one port has a vector integer multiplier.
Two AGU ports generate memory addresses, and provide 4 cycle load-to-use latency for L1D hits. Latency increases by 1-2 cycles with indexed addressing. Virtual addresses from the AGUs are translated to physical addresses by a 32 entry fully associative data TLB. A 1024 L2 TLB handles larger memory footprints. Hitting the L2 TLB adds 11 cycles of latency, which is slow for a low-clocked core. AMD’s Zen 2 and Intel’s Goldmont Plus have 7 and 8 cycle L2 TLB latencies, respectively. Zen 2 notably has twice as much TLB capacity at 2048 entries, and can reach much higher clock speeds.
Load addresses have to be checked against prior store addresses to detect memory dependencies. Store forwarding has 6-7 cycle latency, which is remarkably maintained even when a store only partially overlaps a subsequent load. The core’s L1D appears to operate on 16B aligned blocks. Forwarding latency increases by 1-2 cycles when crossing a 16B boundary. Independent loads and stores can proceed in parallel as long as neither cross a 16B boundary.
TaiShan v110’s 64 KB data cache is 4-way set associative, and can service two 128-bit accesses per cycle. Both can be loads, and one can be a store. Data cache bandwidth is superior to Intel’s Goldmont Plus, which can also do two 128-bit accesses per cycle but is limited to one load and one store. Loads typically outnumber stores by a large margin, so TaiShan v110 should have a bandwidth advantage in practice. Neoverse N1, a newer density optimized design, has similar L1D bandwidth to TaiShan v110.
The L2 Cache on TaiShan v110 has 512 KB of capacity, and is private to each core. Even though cores are arranged in clusters of four, there is no cluster-level shared cache as on Intel’s E-Cores. The L2 has 10 cycle latency, making it faster in cycle count terms than Neoverse N1 or Zen 2.
The L2 can average about 20 bytes per cycle, indicating the core likely has a 32 byte per cycle interface between the L2 and L1D. Using a read-modify-write pattern did not increase bandwidth, so the L2 to L1D interface is likely not bidirectional. Still, there’s plenty of L2 bandwidth considering the core’s large L1D and modest vector capabilities.
TaiShan v110’s L2 seems designed for high performance at the expense of capacity. Goldmont Plus takes the opposite design, using a large 4 MB shared L2 because the L2 also serves as the last level cache. Huawei may have hoped to rely on dynamic L3 partitioning to reduce average L2 miss latency, which lets the L2 design focus on speed.
Huawei selected SPEC CPU2017’s integer suite as a metric to evaluate TaiShan v110, because its target market includes workloads that “involve extensive integer operations.” In single core testing, TaiShan v110 pulls ahead of Arm’s Cortex A72 and Intel’s Goldmont Plus by 22.5% and 7% respectively. It’s no doubt better than cores from prior generations. But its lead over Goldmont Plus is narrow considering TaiShan v110’s process node advantage, larger last level cache, and better DRAM controllers.
Comparing with TSMC 7nm peers puts TaiShan v110 in a tougher spot. Arm’s Neoverse N1 is 52.2% faster than TaiShan v110. AMD’s Zen 2 takes a massive lead, which is expected for a high performance design.
While TaiShan v110 is overall better than Goldmont Plus, it falls behind on 505.mcf, 525.x264, and 503.bwaves. In all three cases, TaiShan v110 suffered more mispredicts per instruction and worse branch prediction accuracy. Somehow, those tests challenged TaiShan v110’s predictor, even though its branch predictor achieved similar accuracy to Goldmont Plus in other subtests that stress the branch predictor, like 541.leela.
Neoverse N1 wins against TaiShan v110 in every subtest. Neoverse N1’s largest wins come from 505.mcf and 525.x264. The former sees Neoverse N1 get 15.03 branch MPKI compared to 16.64 on TaiShan v110. 505.mcf is very bound by backend memory accesses in addition to branch mispredicts, making it an overall nightmare for any CPU core. Ampere Altra’s cache setup and better branch predictor likely combine to let it outperform Kunpeng 920 by over 100%. The situation with 525.x264 is harder to understand. I suspect I got a bad run on Kunpeng 920 when getting a score report, because a subsequent run with performance counters suggests the score gap shouldn’t be so large based on achieved IPC, actual instruction counts, and clock speeds. However with time limits and a remote testing setup, there is no opportunity to follow up on that.
Regardless of what’s going on with 525.x264, Neoverse N1’s advantage is clear. N1 has an excellent branch predictor that’s nearly on par with the one in Zen 2 when taking the geomean of branch prediction accuracy across all subtests. Its out-of-order execution engine is only slightly larger than the one in TaiShan v110. But N1’s backend resources are better balanced. It has more integer-side scheduler entries and a larger FP/vector register file. TaiShan v110 often felt pressure in both areas. At the memory subsystem, any advantage from Kunpeng 920’s partition mode seems to be offset by Neoverse N1’s larger L2.
Kunpeng 920 hosts a collection of fascinating features. It’s an early adopter of TSMC’s 7nm node in the server world, beating Ampere Altra and Zen 2 server variants to market. It uses TSMC’s CoWoS packaging, at a time when AMD opted for simpler on-package traces and Ampere stuck with a monolithic design. Dynamic L3 behavior is a standout feature when others (besides IBM) only operated their L3 caches in the equivalent of “shared” mode. I’m sure many tech enthusiasts have looked at multi-bank L3 designs on Intel and AMD CPUs, and wondered whether they’d try to keep L3 data closer to the cores using it most. Well, Huawei tries to do exactly that.
Yet Kunpeng 920 struggles to convert these features into advantages. CoWoS’s high cross-die bandwidth seems wasted if the chip isn’t set up to behave like a monolithic design to software. The L3’s partition mode provides inconsistent performance depending on data sharing behavior. L3 performance is poor when a single core needs to use most of L3 capacity or if cores share data. Zen 2’s uniformly fast L3 is more consistent and higher performance even if it doesn’t take advantage of bank/core locality. Neoverse N1’s use of a larger L2 to insulate the core from L3 latency also looks like a better option in practice. Perhaps the only advantage that came through was the flexibility of its chiplet design, which lets Huawei cover a wider variety of product categories while reusing dies.
At the core level, it’s hard to escape the conclusion that TSMC’s 7nm advantages were wasted too. Neoverse N1 targeted similar goals on the same node, and did a better job. Arm’s talent in density optimized designs really shows through. They were able to cram larger structures into the same area, including a branch predictor with a 6K entry BTB and a bigger vector register file. They were able to better tune core structures to reduce backend resource stalls in latency bound workloads. And finally, they were able to give Neoverse N1 twice as much L2 capacity while keeping it all within the same area footprint as TaiShan v110. A comparison with AMD is much harder because of different design goals. But it’s interesting that Zen 2 achieves similar area efficiency when running at desktop cores, even though increasing core width and reordering capacity runs into diminishing returns.
A comparison between those cores gives the impression that AMD and Arm took TSMC’s 7nm node and really made the most of it, while HiSilicon merely did an adequate job. But an adequate job may be enough. Huawei doesn't need TaiShan v110 to go head-to-head with Neoverse N1 and Zen 2. It needs a decent core that can keep its business going. TaiShan v110 is perfectly capable of fulfilling that role. Perhaps more importantly, HiSilicon’s early uptake of advanced TSMC tech and willingness to experiment with dynamic L3 behavior shows that HiSilicon’s engineers are not afraid to play aggressively. That means TaiShan v110 can serve a springboard for future designs, providing a path to secure Huawei’s future.
TaiShan v110 port assignments for micro-ops from various instructions: https://github.com/qcjiang/OSACA/blob/feature/tsv110/osaca/data/tsv110.yml
BIOS settings for a Kunpeng 920 server indicating the L3 can be statically set to use shared or private mode: https://support.huawei.com/enterprise/zh/doc/EDOC1100088653/98b06651
Huawei Research’s publication on Kunpeng 920 (starts on page 126): https://www-file.huawei.com/-/media/corp2020/pdf/publications/huawei-research/2022/huawei-research-issue1-en.pdf
Shanghao Liu et al, Efficient Locality-aware Instruction Stream Scheduling for Stencil Computation on ARM Processors
Description of Kunpeng 920’s NUMA behavior on larger SKUs, indicating each compute die acts as a NUMA node: https://www.hikunpeng.com/document/detail/en/perftuning/progtuneg/kunpengprogramming_05_0004.html
2025-07-11 23:54:55
Today, we’re used to desktop processors with 16 or more cores, and several times that in server CPUs. But even prior to 2010, Intel and AMD were working hard to reach higher core counts. AMD’s “Magny Cours”, or Opteron 6000 series, was one such effort. Looking into how AMD approached core count scaling with pre-2010s technology should make for a fun retrospective. Special thanks goes to cha0shacker for providing access to a dual socket Opteron 6180 SE system.
A Magny Cours chip is basically two Phenom II X6 CPU dies side by side. Reusing the prior generation’s 6-core die reduces validation requirements and improves yields compared to taping out a new higher core count die. The two dies are connected via HyperTransport (HT) links, which previously bridged multiple sockets starting from the K8 generation. Magny Cours takes the same concept but runs HyperTransport signals through on-package PCB traces. Much like a dual socket setup, the two dies on a Magny Cours chip each have their own memory controller, and cores on a die enjoy faster access to locally attached memory. That creates a NUMA (Non-Uniform Memory Access) setup, though Magny Cours can also be configured to interleave memory accesses across nodes to scale performance with non-NUMA aware code.
Magny Cours’s in-package links have an awkward setup. Each die has four HT ports, each 16 bits wide and capable of operating in an “unganged” mode to provide two 8-bit sub links. The two dies are connected via a 16-bit “ganged” link along with a 8-bit sublink from another port. AMD never supported using the 8-bit cross-die link though, perhaps because its additional bandwidth would be difficult to utilize and interleaving traffic across uneven links sounds complicated. Magny Cours uses Gen 3 HT links that run at up to 6.4 GT/s, so the two dies have 12.8 GB/s of bandwidth between them. Including the disabled 8-bit sublink would increase intra-package bandwidth to 19.2 GB/s.
With one and a half ports connected within a package, each die has 2.5 HT ports available for external connectivity. AMD decided to use those to give a G34 package four external HT ports. One is typically used for IO in single or dual socket systems, while the other three connect to the other socket.
Quad socket systems get far more complicated, and implementations can allocate links to prioritize either IO bandwidth or cross-socket bandwidth. AMD’s slides show a basic example where four 16-bit HT links are allocated to IO, but it’s also possible to have only two sockets connect to IO.
In the dual socket setup we’re testing with here, two ports operate in “ganged” mode and connect corresponding dies on the two sockets. The third port is “unganged” to provide a pair of 8-bit links, which connect die 0 on the first socket to die 1 on the second socket, and vice versa. That creates a fully connected mesh. The resulting topology resembles a square, with more link bandwidth along its sides and less across its diagonals.
Cross-node memory latency is 120-130 ns, or approximately 50-60 ns more than a local memory access. Magny Cours lands in the same latency ballpark as a newer Intel Westmere dual socket setup. Both dual socket systems from around 2010 offer significantly lower latencies for both local and remote accesses compared to modern systems. The penalty for a remote memory access over a local one is also lower, suggesting both the memory controllers and cross-socket links have lower latency.
Like prior AMD generations, Magny Cours’s memory controllers (MCTs) are responsible for ensuring coherency. They can operate in a broadcast mode, where the MCTs probe everyone with each memory request. While simple, this scheme creates a lot of probe traffic and increases DRAM latency because the MCTs have to wait for probe responses before returning data from DRAM. Most memory requests don’t need data from another cache, so AMD implemented an “HT assist” option that reserves 1 MB of L3 cache per die for use as a probe filter. The MCTs use the probe filter to remember which lines in its local address space are cached across the system and if so, what state they’re cached in.
Regardless of whether HT assist is enabled, Magny Cours’s MCTs are solely responsible for ensuring cache coherency. Therefore, core to core transfers must be orchestrated by the MCT that owns the cache line in question. Cores on the same die may have to exchange data through another die, if the cache line is homed to that other die. Transfers within the same die have about 180 ns of latency, with a latency increase of an extra ~50 ns to the other die within the same socket. In the worst case, latency can pass 300 ns when bouncing a cache line across three dies (two cores on separate dies, orchestrated by a memory controller on a third die).
For comparison, Intel’s slightly newer Westmere uses core valid bits at the L3 cache to act as a probe filter, and can complete core to core transfers within the same die even if the address is homed to another die. Core to core latencies are also lower across the board.
The bandwidth situation with AMD’s setup is quite complicated because it’s a quad-node system, as opposed to the dual-node Westmere setup with half as many cores. Magny Cours connected via 16-bit HT links gets about 5 GB/s of bandwidth between them, with slightly better performance over an intra-package link as opposed to a cross-socket one. Cross-node bandwidth is lowest over the 8-bit “diagonal” cross-socket links, at about 4.4 GB/s.
From a simple perspective of how fast one socket can read data from another, the Opteron 6180 SE lands in the same ballpark as a Xeon X5650 (Westmere) system. Modern setups of course enjoy massively higher bandwidth, thanks to newer DDR versions, wider memory buses, and improved cross-socket links.
Having cores on both sockets read from the other’s memory pool brings cross-socket bandwidth to just over 17 GB/s, though getting that figure requires making sure the 16-bit links are used rather than the 8-bit ones. Repeating the same experiment but going over the 8-bit diagonal links only achieves 12.33 GB/s. I was able to push total cross-node bandwidth to 19.3 GB/s with a convoluted test where cores on each die read from memory attached to another die over a 16-bit link. To summarize, refer to the following simple picture:
NUMA-aware applications will of course try to keep memory accesses local, and minimize the more expensive accesses over HyperTransport links. I was able to achieve just over 48 GB/s of DRAM bandwidth across the system with all cores reading from their directly attached memory pools. That gives the old Opteron system similar DRAM bandwidth to a relatively modern Ryzen 3950X setup. Of course, the newer 16-core chip has massively higher cache bandwidth and doesn’t have NUMA characteristics.
Magny Cours’s on-die network, or Northbridge, bridges its six cores to the local memory controller and HyperTransport links. AMD’s Northbridge design internally consists of two crossbars, dubbed the System Request Interface (SRI) and XBAR. Cores connect to the SRI, while the XBAR connects the SRI with the memory controller and HyperTransport links. The two-level split likely reduces port count on each crossbar. 10h CPUs have a 32 entry System Request Queue between the SRI and XBAR, up from 24 entries in earlier K8-based Opterons. At the XBAR, AMD has a 56 entry XBAR Scheduler (XCS) that tracks commands from the SRI, memory controller, and HyperTransport links.
Crossbar: This topology is simple to build, and naturally provides an ordered network with low latency. It is suitable where the wire counts are still relatively small. This topology is suitable for an interconnect with a small number of nodes
Arm’s AMBA 5 CHI Architecture Specification (https://kolegite.com/EE_library/datasheets_and_manuals/FPGA/AMBA/IHI0050E_a_amba_5_chi_architecture_spec.pdf), on the tradeoffs between crossbar, ring, and mesh interconnects
The crossbar setup in AMD’s early on-die network does an excellent job of delivering low memory latency. Baseline memory latency with just a pointer chasing pattern is 72.2 ns. Modern server chips with more complicated interconnects often see memory latency exceed 100 ns.
As bandwidth demands increase, the interconnect does a mediocre job of ensuring a latency sensitive thread doesn’t get starved by bandwidth hungry ones. Latency increases to 177 ns with the five other cores on the same die generating bandwidth load, which is more than a 2x increase over unloaded latency. Other nodes connected via HyperTransport can generate even more contention on the local memory controller. With cores on another die reading from the same memory controller, bandwidth drops to 8.3 GB/s, while latency from a local core skyrockets to nearly 400 ns. Magny Cours likely suffers contention at multiple points in the interconnect. The most notable issue though is poor memory bandwidth compared to what the setup should be able to achieve.
Three cores are enough to reach bandwidth limits on Magny Cours, which is approximately 10.4 GB/s. With dual channel DDR3-1333 on each node, the test system should have 21.3 GB/s of DRAM bandwidth per node, or 85.3 GB/s across all four nodes. However, bandwidth testing falls well short: even when using all 6 cores to read from a large array, a single die on the Opteron 6180 SE is barely better than a Phenom II X4 945 with DDR2-800, and slightly worse than a Phenom X4 9950 with fast DDR2. This could be down to the low 1.8 GHz northbridge clock and a narrow link to the memory controller (perhaps 64-bit), or insufficient queue entries at the memory controller to absorb DDR latency. Whatever the case, Magny Cours leaves much of DDR3’s potential bandwidth advantage on the table. This issue seems to be solved in Bulldozer, which can achieve well over 20 GB/s with a 2.2 GHz northbridge clock.
Magny Cours is designed to deliver high core count, not maximize single threaded performance. The Opteron 6180 SE runs its cores at 2.5 GHz and northbridge at 1.8 GHz, which is slower than even first-generation Phenoms. With two desktop dies in the same package, AMD needs to use lower clock speeds to keep power under control. Single-threaded SPEC CPU2017 scores are therefore underwhelming. The Opteron 6180 SE comes in just behind Intel’s later Goldmont Plus core, which can clock up to 2.7 GHz.
AMD’s client designs from a few years later achieved much better SPEC CPU2017 scores. The older Phenom X4 9950 takes a tiny lead across both test suites. Its smaller 2 MB L3 cache is balanced out by higher clocks, with an overclock to 2.8 GHz really helping it out. Despite losing 1 MB of L3 for use as a probe filter and thus only having 5 MB of L3, the Opteron 6180 maintains a healthy L3 hit rate advantage over its predecessor.
The exact advantage of the larger L3 can vary of course. 510.parset is a great showcase of what a bigger cache can achieve, with the huge hit rate difference giving the Opteron 6180 SE a 9.4% lead over the Phenom X4 9950. Conversely, 548.exchange2 is a high IPC test with a very tiny data footprint. In that subtest, the Phenom X4 9950 uses its 12% clock speed advantage to gain a 11% lead.
Technologies available in the late 2000s made core count scaling a difficult task. Magny Cours employed a long list of techniques to push core counts higher while keeping cost under control. Carving a snoop filter out of L3 capacity helps reduce die area requirements. At a higher level, AMD reuses a smaller die across different market segments, and uses multiple instances of the same die to scale core counts. As a result, AMD can keep fewer, smaller dies in production. In some ways, AMD’s strategy back then has parallels to their current one, which also seeks to reuse a smaller die across different purposes.
While low cost, AMD’s approach has downsides. Because each hexacore die is a self-contained unit with its own cores and memory controller, Magny Cours relies on NUMA-aware software for optimal performance. Intel also has to deal with NUMA characteristics when scaling up core counts, but their octa-core Nehalem-EX and 10-core Westmere EX provide more cores and more memory bandwidth in each NUMA node. AMD’s HyperTransport and low latency northbridge do get credit for keeping NUMA cross-node costs low, as cross-node memory accesses have far lower latency than in modern designs. But AMD still relies more heavily on software written with NUMA in mind than Intel.
Digging deeper reveals quirks with Magny Cours. Memory bandwidth is underwhelming for a DDR3 system, and the Northbridge struggles to maintain fairness under high bandwidth load. A four-node system might be fully connected, but “diagonal” links have lower bandwidth. All of that makes the system potentially more difficult to tune for, especially compared to a modern 24-core chip with uniform memory access. Creating a high core count system was quite a challenge in the years leading up to 2010, and quirks are expected.
But Magny Cours evidently worked well enough to chart AMD’s scalability course for the next few generations. AMD would continue to reuse a small die across server and client products, and scaled core count by scaling die count. Even for their Bulldozer Opterons, AMD kept using this general setup of four sockets with 2 dies each, only updating it for Zen1, which is perhaps the ultimate evolution of the Magny Cours strategy, hitting four dies per socket and replacing the Northbridge/HyperTransport combination with Infinity Fabric components. Today, AMD continues to use a multi-die strategy, though with more die types (IO dies and CCDs) to provide more uniform memory access.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-07-07 04:56:17
Lion Cove is Intel’s latest high performance CPU architecture. Compared to its predecessor, Raptor Cove, Intel’s newest core can sustain more instructions per cycle, reorganizes the execution engine, and adds an extra level to the data cache hierarchy. The list of changes goes on, with tweaks to just about every part of the core pipeline. Lion Cove does well in the standard SPEC CPU2017 benchmark suite, where it posts sizeable gains especially in higher IPC subtests. In the Arrow Lake desktop platform, Lion Cove can often go head-to-head against AMD’s Zen 5, and posts an overall lead over Intel’s prior Raptor Cove while pulling less power. But a lot of enthusiasts are interested in gaming performance, and games have different demands from productivity workloads.
Here, I’ll be running a few games while collecting performance monitoring data. I’m using the Core Ultra 9 285K with DDR5-6000 28-36-36-96, which is the fastest memory I have available. E-Cores are turned off in the BIOS, because setting affinity to P-Cores caused massive stuttering in Call of Duty. In Cyberpunk 2077, I’m using the built-in benchmark at 1080P and medium settings, with upscaling turned off. In Palworld, I’m hanging out near a base, because CPU load tends to be higher with more entities around.
Gaming workloads generally fall at the low end of the IPC range. Lion Cove can sustain eight micro-ops per cycle, which roughly corresponds to eight instructions per cycle because most instructions map to a single micro-op. It posts very high IPC figures in several SPEC CPU2017 tests, with some pushing well past 4 IPC. Games however get nowhere near that, and find company with lower IPC tests that see their performance limited by frontend and backend latency.
Top-down analysis characterizes how well an application is utilizing a CPU core’s width, and accounts for why pipeline slots go under-utilized. This is usually done at the rename/allocate stage, because it’s often the narrowest stage in the core’s pipeline, which means throughput lost at that stage can’t be recovered later. To briefly break down the reasons:
Bad Speculation: Slot was utilized, but the core was going down the wrong path. That’s usually due to a branch mispredict.
Frontend Latency: Frontend didn’t deliver any micro-ops to the renamer that cycle
Frontend Bandwidth: The frontend delivered some micro-ops, but not enough to fill all renamer slots (eight on Lion Cove)
Core Bound: The backend couldn’t accept more micro-ops from the frontend, and the instruction blocking retirement isn’t a memory load
Backend Memory Bound: As above, but the instruction blocking retirement is a memory load. Intel only describes the event as “TOPDOWN.MEMORY_BOUND_SLOTS” (event 0xA4, unit mask 0x10), but AMD and others explicitly use the criteria of a memory load blocking retirement for their corresponding metrics. Intel likely does the same.
Retiring: The renamer slot was utilized and the corresponding micro-op was eventually retired (useful work)
Core width is poorly utilized, as implied by the IPC figures above. Backend memory latency accounts for a plurality of lost pipeline slots, though there’s room for improvement in instruction execution latency (core bound) and frontend latency as well. Bad speculation and frontend bandwidth are not major issues.
Lion Cove has a 4-level data caching setup, with the L1 data cache split into two levels. I’ll be calling those L1 and L1.5 for simplicity, because the second level of the L1 lands between the first level and the 3 MB L2 cache in capacity and performance.
Lion Cove’s L1.5 catches a substantial portion of L1 misses, though its hitrate isn’t great in absolute terms. It gives off some RDNA 128 KB L1 vibes, in that it takes some load off the L2 but often has mediocre hitrates. L2 hitrate is 49.88%, 71.87%, and 50.98% in COD, Palworld, and Cyberpunk 2077 respectively. Cumulative hitrate for the L1.5 and L2 comes in at 75.54%, 85.05%, and 85.83% across the three games. Intel’s strategy of using a larger L2 to keep traffic off L3 works to a certain extent, because most L1 misses are serviced without leaving the core.
However, memory accesses that do go to L3 and DRAM are very expensive. Lion Cove can provide an idea of how often each level in the memory hierarchy limits performance. Specifically, performance monitoring events count cycles where no micro-ops were ready to execute, a load was pending from a specified cache level, and no loads missed that level of cache. For example, a cycle would be L3 bound if the core was waiting for data from L3, wasn’t also waiting for data from DRAM, and all pending instructions queued up in the core were blocked waiting for data. An execute stage stall doesn’t imply performance impact, because the core has more execution ports than renamer slots. The execute stage can race ahead after stalling for a few cycles without losing average throughput. So, this is a measurement of how hard the core has to cope, rather than whether it was able to cope.
Intel’s performance events don’t distinguish between L1 and L1.5, so both are counted as “L1 Bound” in the graph above. The L1.5 seems to move enough accesses off L2 to minimize the effect of L2 latency. Past L2 though, L3 and DRAM performance have a significant impact. L2 misses may be rare in an absolute sense, but they’re not quite rare enough considering the high cost of a L3 or DRAM access.
Lion Cove and the Arrow Lake platform can monitor queue occupancy at various points in the memory hierarchy. Dividing occupancy by request count provides average latency in cycles, giving an idea of how much latency the core has to cope with in practice.
Count occurrences (rising-edge) of DCACHE_PENDING sub-event0. Impl. sends per-port binary inc-bit the occupancy increases* (at FB alloc or promotion).
Intel’s description for the L1D_MISS.LOAD event, which unhelpfully doesn’t indicate which level of the L1 it counts for.
These performance monitoring events can be confusing. The L1D_MISS.LOAD event (event 0x49, unit mask 1) increments when loads miss the 48 KB L1D. However the corresponding L1D_PENDING.LOAD event (event 0x48, unit mask 1) only accounts for loads that miss the 192 KB L1.5. Using both events in combination treats L1.5 hits as zero latency. It does accurately account for latency to L2 and beyond, though only from the perspective of a queue between the L1.5 and L2.
Measuring latency at the arbitration queue (ARB) can be confusing in a different way. The ARB runs at the CPU tile’s uncore clock, or 3.8 GHz. That’s well below the 5.7 GHz maximum CPU core clock, so the ARB will see fewer cycles of latency than the CPU core does. Therefore, I’m adding another set of bars with post-ARB latency multiplied by 5.7/3.8, to approximate latency in CPU core cycles.
Another way to get a handle on latency is to multiply by cycle time to approximate actual latency. Clocks aren’t static on Arrow Lake, so there’s additional margin of error. But doing so does show latency past the ARB remains well controlled, so DRAM bandwidth isn’t a concern. If games were approaching DRAM bandwidth limits, latency would go much higher as requests start piling up at the ARB queue and subsequent points in the chip’s interconnect.
Much of the action happens at the backend, but Lion Cove loses some throughput at the frontend too. Instruction-side accesses tend to be more predictable than data-side ones, because instructions are executed sequentially until the core reaches a branch. That means accurate branch prediction can let the core hide frontend latency.
Lion Cove’s branch predictor enjoys excellent accuracy across all three games. Mispredicts however can still be an issue. Just as the occasional L3 or DRAM access can be problematic because they’re so expensive, recovering from a branch mispredict can hurt too. Because a mispredict breaks the branch predictor’s ability to run ahead, it can expose the core to instruction-side cache latency. Fetching the correct branch target from L2 or beyond could add dozens of cycles to mispredict recovery time. Ideally, the core would contain much of an application’s code footprint within the fastest instruction cache levels to minimize that penalty.
Lion Cove’s frontend can source micro-ops from four sources. The loop buffer, or Loop Stream Detector (LSD) and microcode sequencer play a minor role. Most micro-ops come from the micro-op cache, or Decoded Stream Buffer (DSB). Even though the op cache delivers a majority of micro-ops, it’s not large enough to serve as the core’s primary instruction cache. Lion Cove gets a 64 KB instruction cache, carried over from Redwood Cove. Intel no longer documents events that would allow a direct L1i hitrate calculation. However, older events from before Alder Lake still appear to work. Micro-op cache hits are counted as instruction cache hits from testing with microbenchmarks. Therefore, figures below indicate how often instruction fetches were satisfied without going to L2.
The 64 KB instruction cache does its job, keeping the vast majority of instruction fetches from reaching L2. Code hitrate from L2 is lower, likely because accesses that miss L1i have worse locality in the first place. Instructions also have to contend with data for L2 capacity. L2 code misses don’t happen too often, but can be problematic just as on the data side because of the dramatic latency jump.
Among the three games here, Cyberpunk 2077’s built-in benchmark has better code locality, while Palworld suffers the most. That’s reflected in average instruction-side latency seen by the core. When running Palworld, Lion Cove takes longer to recover from pipeline resteers, which largely come from branch mispredicts. Recovery time here refers to cycles elapsed until the renamer issues the first micro-op from the correct path.
Offcore code read latency can be tracked in the same way as demand data reads. Latency is lower than on the data side, suggesting higher code hitrate in L3. However, hiding hundreds of cycles of latency is still a tall order for the frontend, just as it is for the backend. Again, Lion Cove’s large L2 does a lot of heavy lifting.
Performance counters provide insight into other delays as well. A starts with the renamer (allocator) restoring a checkpoint with known-good state[1], which takes 3-4 cycles and as expected, doesn’t change across the three games. Lion Cove can also indicate how often the instruction fetch stage stalls. Setting the edge/cmask bits can indicate how long each stall lasts. However, it’s hard to determine the performance impact from L1i misses because the frontend has deep queues that can hide L1i miss latency. Furthermore, an instruction fetch stall can overlap with a backend resource stall.
While pipeline resteers seem to account for the bulk of frontend-related throughput losses, other reasons can contribute too. Structures within the branch predictor can override each other, for example when a slower BTB level overrides a faster one (BPClear). Large branch footprints can exceed the branch predictor’s capacity to track them, and cause a BAClear in Intel terminology. That’s when the frontend discovers a branch not tracked by the predictor, and must redirect instruction fetch from a later stage. Pipeline bubbles from both sources have a minor impact, so Lion Cove’s giant 12K entry BTB does a good job.
In a latency bound workload like gaming, the retirement stage operates in a feast-or-famine fashion. Most of the time it can’t do anything. That’s probably because a long latency instruction is blocking retirement, or the ROB is empty from a very costly mispredict. When the retirement stage is unblocked, throughput resembles a bathtub curve. Often it crawls forward with most retire slots idle. The retirement stage spends very few cycles retiring at medium-high throughput.
Likely, retirement is either crawling forward in core-bound scenarios when a short latency operation completes and unblocks a few other micro-ops that complete soon softer, or is bursting ahead after a long latency instruction completes and unblocks retirement for a lot of already completed instructions.
Lion Cove can retire up to 12 micro-ops per cycle. Once it starts using its full retire width, the core on average blasts through 28 micro-ops before getting blocked again.
Compared to Zen 4, Lion Cove suffers harder with backend memory latency, but far less from frontend latency. Part of this can be explained by Zen 4’s stronger data-side memory subsystem. The AMD Ryzen 9 7950X3D I previously tested on has 96 MB of L3 cache on the first die, and has lower L3 latency than Lion Cove in Intel’s Arrow Lake platform. Beyond L3, AMD achieves better load-to-use latency even with slower DDR5-5600 36-36-36-89 memory. Intel’s interconnect became more complex when they shifted to a chiplet setup, and there’s clearly some work to be done.
Lion Cove gets a lot of stuff right as well, because the core’s frontend is quite strong. The larger BTB and larger instruction cache compared to Zen 4 seem to do a good job of keeping code fetches off slower caches. Lion Cove’s large L2 gets credit too. It’s not perfect, because the occasional instruction-side L2 miss has an average latency in the hundreds of cycles range. But Intel’s frontend improvements do pay off.
Even though Intel and AMD have different relative strengths, a constant factor is that games are difficult, low IPC workloads. They have large data-side footprints with poor access locality. Instruction-side accesses are difficult too, though not to the same extent because modern branch predictors can mostly keep up. Both factors together mean many pipeline slots go unused. Building a wider core brings little benefit because getting through instructions isn’t the problem. Rather, the challenge is in dealing with long stalls as the core waits for data or instructions to arrive from lower level caches or DRAM. Intel’s new L1.5 likely has limited impact as well. It does convert some already fast L2 hits into even faster accesses, but it doesn’t help with long stalls as the core waits for data from L3 or DRAM.
Comparing games to SPEC CPU2017 also emphasizes that games aren’t the only workloads out there. Wider cores with faster upper level caches can pay off in a great many SPEC CPU2017 tests, especially those with very high IPC. Conversely, a focus on improving DRAM performance or increasing last level cache capacity would provide minimal gains for workloads that already fit in cache. Optimization strategies for different workloads are often in conflict, because engineers must decide where to allocate a limited power and area budget. They have limited time to determine the best tradeoff too. Intel, AMD, and others will continue to tune their CPU designs to meet expected workloads, and it’ll be fun to see where they go.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
Henry Wong suggests the INT_MISC.RECOVERY_CYCLES event, which is present on Lion Cove as well as Haswell, accounts for time taken for a mapping table recovery. The renamer maintains a register alias table (mapping) that maps architectural registers to renamed physical registers. Going back to a known good state would mean restoring a previous version of the table prior to the mispredicted branch. https://www.stuffedcow.net/files/henry-thesis-phd.pdf
2025-06-29 08:26:53
Nvidia has a long tradition of building giant GPUs. Blackwell, their latest graphics architecture, continues that tradition. GB202 is the largest Blackwell die. It occupies a massive 750mm2 of area, and has 92.2 billion transistors. GB202 has 192 Streaming Multiprocessors (SMs), the closest equivalent to a CPU core on a GPU, and feeds them with a massive memory subsystem. Nvidia’s RTX PRO 6000 Blackwell features the largest GB202 configuration to date. It sits alongside the RTX 5090 in Nvidia’s lineup, which also uses GB202 but disables a few more SMs.
A high level comparison shows the scale of Nvidia’s largest Blackwell products. AMD’s RDNA4 line tops out with the RX 9070 and RX 9070XT. The RX 9070 is slightly cut down, with four WGPs disabled out of 32. I’ll be using the RX 9070 to provide comparison data.
A massive thanks to Will Killian for giving us access to his RTX PRO 6000 Blackwell system for us to test. And so, a massive thanks goes out to him for this article!
GPUs use specialized hardware to launch threads across their cores, unlike CPUs that rely on software scheduling in the operating system. Hardware thread launch is well suited to the short and simple tasks that often characterize GPU workloads. Streaming Multiprocessors (SMs) are the basic building block of Nvidia GPUs, and are roughly analogous to a CPU core. SMs are grouped into Graphics Processing Clusters (GPCs), which contain a rasterizer and associated work distribution hardware.
GB202 has a 1:16 GPC to SM ratio, compared to the 1:12 ratio found in Ada Lovelace’s largest AD102 die. That lets Nvidia cheaply increase SM count and thus compute throughput without needing more copies of GPC-level hardware. However, dispatches with short-duration waves may struggle to take advantage of Blackwell’s scale, as throughput becomes limited by how fast the GPC can allocate work to the SMs rather than how fast the SMs can finish them.
AMD’s RDNA4 uses a 1:8 SE:WGP ratio, so one rasterizer feeds a set of eight WGPs in a Shader Engine. WGPs on AMD are the closest equivalent to SMs on Nvidia, and have the same nominal vector lane count. RDNA4 will be easier to utilize with small dispatches and short duration waves, but it’s worth noting that Blackwell’s design is not out of the ordinary. Scaling up GPU “cores” independently of work distribution hardware is a common technique for building larger GPUs. AMD’s RX 6900XT (RDNA2) had a 1:10 SE:WGP ratio. Before that, AMD’s largest GCN implementations like Fury X and Vega 64 had a 1:16 SE:CU ratio (CUs, or Compute Units, formed the basic building block of GCN GPUs). While Blackwell does have the same ratio as those large GCN parts, it enjoys higher clock speeds and likely has a higher wave launch rate to match per-GPU-core throughput. It won’t suffer as much as the Fury X from 10 years ago with short duration waves, but GB202 will still be harder to feed than smaller GPUs.
Although Nvidia didn’t scale up work distribution hardware, they did make improvements on Blackwell. Prior Nvidia generations could not overlap workloads of different types on the same queue. Going between graphics and compute tasks would require a “subchannel switch” and a “wait-for-idle”. That requires one task on the queue to completely finish before the next can start, even if a game doesn’t ask for synchronization. Likely, higher level scheduling hardware that manages queues exposed to host-side applications can only track state for one workload type at a time. Blackwell does away with subchannel switches, letting it more efficiently fill its shader array if applications frequently mix different work types on the same queue.
Once assigned work, the SM’s frontend fetches shader program instructions and delivers them to the execution units. Blackwell uses fixed length 128-bit (16 byte) instructions, and the SM uses a two-level instruction caching setup. Both characteristics are carried forward from Nvidia’s post-Turing/Volta designs. Each of the SM’s four partitions has a private L0 instruction cache, while a L1 instruction cache is shared across the SM.
Nvidia’s long 16 byte instructions translate to high instruction-side bandwidth demands. The L0+L1 instruction caching setup is likely intended to handle those bandwidth demands while also maintaining performance with larger code footprints. Each L0 only needs to provide one instruction per cycle, and its smaller size should make it easier to optimize for high bandwidth and low power. The SM-level L1 can then be optimized for capacity.
Blackwell’s L1i is likely 128 KB, from testing with unrolled loops of varying sizes and checking generated assembly (SASS) to verify the loop’s footprint. The L1i is good for approximately 8K instructions, providing a good boost over prior Nvidia generations.
Blackwell and Ada Lovelace both appear to have 32 KB L0i caches, which is an increase over 16 KB in Turing. The L1i can fully feed a single partition, and can filter out redundant instruction fetches to feed all four partitions. However, L1i bandwidth can be a visible limitation if two waves on different partitions spill out of L1i and run different code sections. In that case, per-wave throughput drops to one instruction per two cycles.
AMD uses variable length instructions ranging from 4 to 12 bytes, which lowers capacity and bandwidth pressure on the instruction cache compared to Nvidia. RDNA4 has a 32 KB instruction cache shared across a Workgroup Processor (WGP), much like prior RDNA generations. Like Nvidia’s SM, a WGP is divided into four partitions (SIMDs). RDNA1’s whitepaper suggests the L1i can supply 32 bytes per cycle to each SIMD. It’s an enormous amount of bandwidth considering AMD’s more compact instructions. Perhaps AMD wanted to be sure each SIMD could co-issue to its vector and scalar units in a sustained fashion. In this basic test, RDNA4’s L1i has no trouble maintaining full throughput when two waves traverse different code paths. RDNA4 also enjoys better code read bandwidth from L2, though all GPUs I’ve tested show poor L2 code read bandwidth.
Each Blackwell SM partition can track up to 12 waves to hide latency, which is a bit lower than the 16 waves per SIMD on RDNA4. Actual occupancy, or the number of active wave slots, can be limited by a number of factors including register file capacity. Nvidia has not changed theoretical occupancy or register file capacity since Ampere, and the latter remains a 64 KB per partition. A kernel therefore can’t use more than 40 registers while using all 12 wave slots, assuming allocation granularity hasn’t changed since Ada and is still 8 registers. For comparison, AMD’s high end RDNA3/4 SIMDs have 192 KB vector register files, letting a kernel use up to 96 registers while maintaining maximum occupancy.
Blackwell’s primary FP32 and INT32 execution pipelines have been reorganized compared to prior generations, and are internally arranged as one 32-wide execution pipe. That creates similarities to AMD’s RDNA GPUs, as well as Nvidia’s older Pascal. Having one 32-wide pipe handle both INT32 and FP32 means Blackwell won’t have to stall if it encounters a long stream of operations of the same type. Blackwell inherits Turing’s strength of being able to do 16 INT32 multiplies per cycle in each partition. Pascal and RDNA GPUs can only do INT32 multiplies at approximately quarter arte (8 per partition, per cycle).
Compared to Blackwell, AMD’s RDNA4 packs a lot of vector compute into each SIMD. Like RDNA3, RDNA4 can use VOPD dual issue instructions or wave64 mode to complete 64 FP32 operations per cycle in each partition. An AMD SIMD can also co-issue instructions of different types from different waves, while Nvidia’s dispatch unit is limited to one instruction per cycle. RDNA4’s SIMD also packs eight special function units (SFUs) compared to four on Nvidia. These units handle more complex operations like inverse square roots and trigonometric functions.
Differences in per-partition execution unit layout or count quickly get pushed aside by Blackwell’s massive SM count. Even when the RX 9070 can take advantage of dual issue, 28 WGPs cannot take on 188 SMs. Nvidia holds a large lead in every category.
Nvidia added floating point instructions to Blackwell’s uniform datapath, which dates back to Turing and serves a similar role to AMD’s scalar unit. Both offload instructions that are constant across a wave. Blackwell’s uniform FP instructions include adds, multiples, FMAs, min/max, and conversions between integer and floating point. Nvidia’s move mirrors AMD’s addition of FP scalar instructions with RDNA 3.5 and RDNA4.
Still, Nvidia’s uniform datapath feels limited compared to AMD’s scalar unit. Uniform registers can only be loaded from constant memory, though curiously a uniform register can be written out to global memory. I wasn’t able to get Nvidia’s compiler to emit uniform instructions for the critical part of any instruction or cache latency tests, even when loading values from constant memory.
Raytracing has long been a focus of Nvidia’s GPUs. Blackwell doubles the per-SM ray triangle intersection test rate, though Nvidia does not specify what the box or triangle test rate is. Like Ada Lovelace, Blackwell’s raytracing hardware supports “Opacity Micromaps”, providing functionality similar to the sub-triangle opacity culling referenced by Intel’s upcoming Xe3 architecture.
Like Ada Lovelace and Ampere, Blackwell has a SM-wide 128 KB block of storage that’s partitioned for use as L1 cache and Shared Memory. Shared Memory is Nvidia’s term for a software managed scratchpad, which backs the local memory space in OpenCL. AMD’s equivalent is the Local Data Share (LDS), and Intel’s is Shared Local Memory (SLM). Unlike with their datacenter GPUs, Nvidia has chosen not to increase L1/Shared Memory capacity. As in prior generations, different L1/Shared Memory splits do not affect L1 latency.
AMD’s WGPs use a more complex memory subsystem, with a high level design that debuted in the first RDNA generation. The WGP has a 128 KB LDS that’s internally built from a pair of 64 KB, 32-bank structures connected by a crossbar. First level vector data caches, called L0 caches, are private to pairs of SIMDs. A WGP-wide 16 KB scalar cache services scalar and constant reads. In total, a RDNA4 WGP has 208 KB of data-side storage divided across different purposes.
A RDNA4 WGP enjoys substantially higher bandwidth from its private memories for global and local memory accesses. Each L0 vector cache can deliver 128 bytes per cycle, and the LDS can deliver 256 bytes per cycle total to the WGP. Mixing local and global memory traffic can further increase achievable bandwidth, suggesting the LDS and L0 vector caches have separate data buses.
Doing the same on Nvidia does not bring per-SM throughput past 128B/cycle, suggesting the 128 KB L1/Shared Memory block has a single 128B path to the execution units.
Yet any advantage AMD may enjoy from this characteristic is blunted as the RX 9070 drops clocks to 2.6 GHz, in order to stay within its 220W power target. Nvidia in contrast has a higher 600W power limit, and can maintain close to maximum clock speeds while delivering 128B/cycle from SM-private L1/Shared Memory storage.
Just as with compute, Nvidia’s massive scale pushes aside any core-for-core differences. The 188 SMs across the RTX PRO 6000 Blackwell together have more than 60 TB/s of bandwidth. High SM count gives Nvidia more total L1/local memory too. Nvidia has 24 MB of L1/Shared Memory across the RTX PRO 6000. AMD’s RX 9070 has just under 6 MB of first level data storage in its WGPs.
SM-private storage typically offers low latency, at least in GPU terms, and that continues to be the case in Blackwell. Blackwell compares favorably to AMD in several areas, and part of that advantage comes down to address generation. I’m testing with dependent array accesses, and Nvidia can convert an array index to an address with a single IMAD.WIDE instruction.
AMD has fast 64-bit address generation through its scalar unit, but of course can only use that if the compiler determines the address calculation will be constant across a wave. If each lane needs to independently generate its own address, AMD’s vector integer units only natively operate with 32-bit data types and must do an add-with-carry to generate a 64-bit address.
Because separating address generation from cache latency is impossible, Nvidia enjoys better L1 vector access latency. AMD can be slightly faster if the compiler can carry out scalar optimizations.
GPUs can also offload address generation to the texture unit, which can handle an array address calculations. The texture unit of course can also do texture filtering, though I’m only asking it to return raw data when testing with OpenCL’s image1d_buffer_t type. AMD enjoys lower latency if the texture unit does address calculations, but Nvidia does not.
GPUs often handle atomic operations with dedicated ALUs placed close to points of coherency, like the LDS or Shared Memory for local memory, or the L2 cache for global memory. That contrasts with CPUs, which rely on locking cachelines to handle atomics and ensure ordering. Nvidia appears to have 16 INT32 atomic ALUs at each SM, compared to 32 for each AMD WGP.
In a familiar trend, Nvidia can crush AMD by virtue of having a much bigger GPU, at least with local memory atomics. Both the RTX PRO 6000 and RX 9070 have surprisingly similar atomic add throughput in global memory, suggesting Nvidia either has fewer L2 banks or fewer atomic units per bank.
RDNA4 and Blackwell have similar latency when threads exchange data through atomic compare and exchange operations, though AMD is slightly faster. The RX 9070 is a much smaller and higher clocked GPU, and both can help lower latency when moving data across the GPU.
Blackwell uses a conventional two-level data caching setup, but continues Ada Lovelace’s strategy of increasing L2 capacity to achieve the same goals as AMD’s Infinity Cache. L2 latency on Blackwell regresses to just over 130 ns, compared to 107 ns on Ada Lovelace. Nvidia’s L2 latency continues to sit between AMD’s L2 and Infinity Cache latencies, though now it’s noticeably closer to the latter.
Tests results using Vulkan suggest the smaller RTX 5070 also has higher L2 latency (122 ns) than the RTX 4090, even though the 5070 has fewer SMs and a smaller L2. Cache latency results from Nemes’s Vulkan test suite should be broadly comparable to my OpenCL ones, because we both use a current = arr[current] access pattern. A deeper look showed minor code generation differences that seem to add ~3 ns of latency to the Vulkan results. That doesn’t change the big picture with L2 latencies. Furthermore, the difference between L1 and L2 latency should approximate the time taken to traverse the on-chip network and access the L2. Differences between OpenCL and Vulkan results are insignificant in that regard. Part of GB202’s L2 latency regression may come from its massive scale, but results from the 5070 suggest there’s more to the picture.
The RTX PRO 6000 Blackwell’s VRAM latency is manageable at 329 ns, or ~200 ns over L2 hit latency. AMD’s RDNA4 manages better VRAM latency at 254 ns for a vector access, or 229 ns through the scalar path. Curiously, Nvidia’s prior Ada Lovelace and Ampere architectures enjoyed better VRAM latency than Blackwell, and are in the same ballpark as RDNA4 and RDNA2.
Blackwell’s L2 bandwidth is approximately 8.7 TB/s, slightly more than the RX 9070’s 8.4 TB/s. Nvidia retains a huge advantage at larger test sizes, where AMD’s Infinity Cache provides less than half the bandwidth. In VRAM, Blackwell’s GDDR7 and 512-bit memory bus continue to keep it well ahead of AMD.
Nvidia’s L2 performance deserves closer attention, because it’s one area where the RX 9070 gets surprisingly close to the giant RTX PRO 6000 Blackwell. A look at GB202’s die photo shows 64 cache blocks, suggesting the L2 is split into 64 banks. If so, each bank likely delivers 64 bytes per cycle (of which the test was able to achieve 48B/cycle). It’s an increase over the 48 L2 blocks in Ada Lovelace’s largest AD102 chip. However, Nvidia's L2 continues to have a tough job serving as both the first stop for L1 misses and as a large last level cache. In other words, it’s doing the job of AMD’s L2 and Infinity Cache levels. There’s definitely merit to cutting down cache levels, because checking a level of cache can add latency and power costs. However, caches also have to make a tradeoff between capacity, performance, and power/area cost.
Nvidia likely relies on their flexible L1/Shared Memory arrangement to keep L2 bandwidth demands under control, and insulate SMs from L2 latency. A Blackwell SM can use its entire 128 KB L1/Shared Memory block as L1 cache if a kernel doesn’t need local memory, while an AMD WGP is stuck with two 32 KB vector caches and a 16 KB scalar cache. However a kernel bound by local memory capacity with a data footprint in the range of several megabytes would put Nvidia at a disadvantage. Watching AMD and Nvidia juggle these tradeoffs is quite fun, though it’s impossible to draw any conclusions with the two products competing in such different market segments.
FluidX3D simulates fluid behavior and can demand plenty of memory bandwidth. It carries out computations with FP32 values, but can convert them to FP16 formats for storage. Doing so reduces VRAM bandwidth and capacity requirements. Nvidia’s RTX PRO 6000 takes a hefty lead over AMD’s RX 9070, as the headline compute and memory bandwidth specifications would suggest.
Nvidia’s lead remains relatively constant regardless of what mode FluidX3D is compiled with.
We technically have more GPU competition than ever in 2025, as Intel’s GPU effort makes steady progress and introduces them as a third contender. On the datacenter side, AMD’s MI300 has proven to be very competitive with supercomputing wins. But competition is conspicuously absent at the top of the consumer segment. Intel’s Battlemage and AMD’s RDNA4 stop at the midrange segment. The RX 9070 does target higher performance levels than Intel’s Arc B580, but neither come anywhere close to Nvidia’s largest GB202 GPUs.
As for GB202, it’s yet another example of Nvidia building as big as they can to conquer the top end. The 750mm2 die pushes the limits of what can be done with a monolithic design. Its 575W or 600W power target tests the limits of what a consumer PC can support. By pushing these limits, Nvidia has created the largest consumer GPU available today. The RTX PRO 6000 incredibly comes close to AMD’s MI300X in terms of vector FP32 throughput, and is far ahead of Nvidia’s own B200 datacenter GPU. The memory subsystem is a monster as well. Perhaps Nvidia’s engineers asked whether they should emphasize caching like AMD’s RDNA2, or lean on VRAM bandwidth like they did with Ampere. Apparently, the answer is both. The same approach applies to compute, where the answer was apparently “all the SMs”.
Building such a big GPU isn’t easy, and Nvidia evidently faced their share of challenges. L2 performance is mediocre considering the massive compute throughput it may have to feed. Beyond GPU size, comparing with RDNA4 shows continued trends like AMD using a smaller number of individually stronger cores. RDNA4’s basic Workgroup Processor building block has more compute throughput and cache bandwidth than a Blackwell SM.
But none of that matters at the top end, because Nvidia shows up with over 6 times as many “cores”, twice as much last level cache capacity, and a huge VRAM bandwidth lead. Some aspects of Blackwell may not have scaled as nicely. But Nvidia’s engineers deserve praise because everyone else apparently looked at those challenges and decided they weren’t going to tackle them at all. Blackwell therefore wins the top end by default. Products like the RTX PRO 6000 are fascinating, and I expect Nvidia to keep pushing the limits of how big they can build a consumer GPU. But I also hope competition at the top end will eventually reappear in the decades and centuries to come.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
SASS instruction listings: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#blackwell-instruction-set
GB202 whitepaper: https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
Blackwell PRO GPU whitepaper: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1.0.pdf
Techpowerup observing that the RTX 5090 could reach 2.91 GHz: https://www.techpowerup.com/review/nvidia-geforce-rtx-5090-founders-edition/44.html
2025-06-21 05:13:03
Hello you fine Internet folks,
At AMD's Advancing AI 2025, I had the pleasure of interviewing Alan Smith, AMD Senior Fellow and Chief Instinct Architect, about CDNA4 found in the MI350 series of accelerators.
Hope y'all enjoy!
Transcript below has been edited for conciseness and readability.
George: Hello you fine internet folks! We're here today at AMD's Advancing AI 2025 event, where the MI350 series has just been announced. And I have the pleasure to introduce, Alan Smith from AMD.
Alan: Hello, everyone.
George: What do you do at AMD?
Alan: So I'm the chief architect of Instinct GPUs.
George: Awesome, what does that job entail?
Alan: So I'm responsible for the GPU product line definition, working with our data center GPU business partners to define the-, work with them on the definition of the requirements of the GPUs, and then work with the design teams to implement those requirements.
George: Awesome. So, moving into MI350: MI350's still GFX9 based, and- for you guys in the audience, GFX9 is also known as Vega, or at least derivative of GFX9. Why is it that MI350 is still on GFX9, whereas clients such as RDNA 3 and 4, GFX11 and 12 respectively?
Alan: Sure, yeah, it's a great question. So as you know, the CDNA architecture off of, you know, previous generations of Instinct GPUs, starting with MI100 and before, like you said, in the Vega generations, were all GCN architecture, which was Graphics Core Next. And over several generations, CDNA's been highly optimized for the types of distributed computing algorithms for high-performance computing and AI. And so we felt like starting with that base for MI350 would give us the right pieces that we needed to deliver the performance goals that we had for the MI350 series.
George: And with GCN, as you know, there's a separate L1 cache and LDS, or Local Data Store. Why is that still in MI350? And why haven't they been merged?
Alan: Yeah, so like you said, you know, it is a legacy piece of the GCN architecture. It's sort of fundamental to the way the compute unit is built. So we felt like in this generation, it wasn't the right opportunity to make a microarchitectural change of that scale. So what we did instead, was we increased the capacity of the LDS. So previously in MI300 series, we had a 64 kilobyte LDS, and we've increased that capacity to 160 kilobytes in MI350 series. And in addition to that, we increased the bandwidth as well. So we doubled the bandwidth of the LDS into the register file, in order to be able to feed the Tensor Core rates that we have in the MI350 series.
George: And speaking of Tensor Cores, you've now introduced microscaling formats to MI350x for FP8, FP6, and FP4 data types. Interestingly enough, a major differentiator for MI350 is that FP6 is the same rate as FP4. Can you talk a little bit about how that was accomplished and why that is?
Alan: Sure, yep, so one of the things that we felt like on MI350 in this timeframe, that it's going into the market and the current state of AI... we felt like that FP6 is a format that has potential to not only be used for inferencing, but potentially for training. And so we wanted to make sure that the capabilities for FP6 were class-leading relative to... what others maybe would have been implementing, or have implemented. And so, as you know, it's a long lead time to design hardware, so we were thinking about this years ago and wanted to make sure that MI350 had leadership in FP6 performance. So we made a decision to implement the FP6 data path at the same throughput as the FP4 data path. Of course, we had to take on a little bit more hardware in order to do that. FP6 has a few more bits, obviously, that's why it's called FP6. But we were able to do that within the area of constraints that we had in the matrix engine, and do that in a very power- and area-efficient way.
George: And speaking of data types, I've noticed that TF32 is not on your ops list for hardware level acceleration. Why remove that feature from... or why was that not a major consideration for MI350?
Alan: Yeah, well, it was a consideration, right? Because we did remove it. We felt like that in this timeframe, that brain float 16, or BF16, would be a format that would be leverageable for most models to replace TF32. And we can deliver a much higher throughput on BF16 than TF32, so we felt like it was the right trade off for this implementation.
George: And if I was to use TF32, what would the speed be? Would it still be FP32, the speed of FP32?
Alan: You have a choice. We offer some emulation, and I don't have all the details on the exact throughputs off the top of my head; but we do offer emulation, software-based emulation using BF16 to emulate TF32, or you can just cast it into FP32 and use it at FP32 rate.
George: And moving from the CU up to the XCD, which is the compute die; the new compute die's now on N3P, and yet there's been a reduction from 40 CUs to 36 CUs physically on the die with four per shader engine fused off. Why 32 CUs now, and, why that reduction?
Alan: Yeah, so on MI300, we had co-designed for both MI300X and MI300A, one for HPC and one for AI. In the MI300A, we have just six XCDs. And so, we wanted to make sure when we only had six of the accelerator chiplets that we had enough compute units to power the level of HPC or high-performance computing - which is traditional simulation in FP64 - to reach the performance levels that we wanted to hit for the leadership-class supercomputers that we were targeting with that market.
And so we did that and delivered the fastest supercomputer in the world along with Lawrence Livermore, with El Capitan. But so as a consideration there, we wanted to have more compute units for XCD so that we could get 224 total within MI300A. On 350, where it's designed specifically as an accelerator only, a discrete accelerator, we had more flexibility there. And so we decided that having a power of two number of active compute units per die - so 36 physical, like you said, but we enable 32. Four of them, one per shader engine, are used for harvesting and we yield those out in order to give us good high-volume manufacturing through TSMC-N3, which is a leading edge technology. So we have some of the spare ones that allow us to end up with 32 actually enabled.
And that's a nice power of two, and it's easy to tile tensors if you have a power of two. So most of the tensors that you're working with, or many of them, would be matrices that are based on a power of two. And so it allows you to tile them into the number of compute units easily, and reduces the total tail effect that you may have. Because if you have a non-power of two number of compute units, then some amount of the tensor may not map directly nicely, and so you may have some amount of work that you have to do at the end on just a subset of the compute unit. So we find that there's some optimization there by having a power of two.
George: While the new compute unit is on N3P, the I/O die is on N6; why stick with N6?
Alan: Yeah, great question. What we see in our chiplet technology, first of all, we have the choice, right? So being on chiplets gives you the flexibility to choose different technologies if appropriate. And the things that we have in the I/O die tend not to scale as well with advanced technologies. So things like the HBM PHYs, the high-speed SERDES, the caches that we have with the Infinity Cache, the SRAMs, those things don't scale as well. And so sticking with an older technology with a mature yield on a big die allows us to deliver a product cost and a TCO (Total Cost of Ownership) value proposition for our clients. And then we're able to leverage the most advanced technologies like N3P for the compute where we get a significant benefit in the power- and area-scaling to implement the compute units.
George: And speaking of, other than the LDS, what's interesting to me is that there have not been any cache hierarchy changes. Why is that?
Alan: Yeah, great question. So if you remember what I just said about MI300 being built to deliver the highest performance in HPC. And in order to do that, we needed to deliver significant global bandwidth into the compute units for double precision floating point. So we had already designed the Infinity Fabric and the fabric within the XCC or the Accelerated Compute Core to deliver sufficient bandwidth to feed the really high double precision matrix operations in MI300 and all the cache hierarchy associated with that. So we were able to leverage that amount of interconnecting capabilities that we had already built into MI300 and therefore didn't need to make any modifications to those.
George: And with MI350, you've now moved from four base dies to two base dies. What has that enabled in terms of the layout of your top dies?
Alan: Yeah, so what we did, as you mentioned, so in MI350, the I/O dies, there's only two of them. And then each of them host four of the accelerator chiplets versus in MI300, we had four of the I/O dies, with each of them hosting two of the accelerator chiplets. So that's what you're talking about.
So what we did was, we wanted to increase the bandwidth from global, from HBM, which, MI300 was designed for HBM3 and MI350 was specially designed for HBM3E. So we wanted to go from 5.2 or 5.6 gigabit per second up to a full 8 gigabit per second. But we also wanted to do that at the lowest possible power, because delivering the bytes from HBM into the compute cores at the lowest energy per bit gives us more power at a fixed GPU power level, gives us more power into the compute at that same time. So on bandwidth-bound kernels that have a compute element, by reducing the amount of power that we spend in data transport, we can put more power into the compute and deliver a higher performance for those kernels.
So what we did by combining those two chips together into one was we were able to widen up the buses within those chips; so we deliver more bytes per clock, and therefore we can run them at a lower frequency and also a lower voltage, which gives us the V-squared scaling of voltage for the amount of power that it takes to deliver those bits. So that's why we did that.
George: And speaking of power, MI350x is 1000 watts, and MI355x is 1400 watts. What are the different thermal considerations when considering that 40% uptick in power, not just in terms of cooling the system, but also keeping the individual chiplets within tolerances?
Alan: Great question, and obviously we have some things to consider for our 3D architectures as well.
So when we do our total power and thermal architecture of these chips, we consider from the motherboard all the way up to the daughterboards, which are the UBB (Universal Baseboard), the OAM (OCP Accelerator Module) modules in this case, and then up through the stack of CoWoS (Chip on Wafer on Substrate), the I/O dies, which are in this intermediate layer, and then the compute that's above those. So we look at the total thermal density of that whole stack, and the amount of thermal transport or thermal resistance that we have within that stack, and the thermal interface materials that we need in order to build on top of that for heat removal, right?
And so we offer two different classes of thermal solutions for MI350 series. One of them air-cooled, like you mentioned. The other one is a direct-attach liquid cool. So the cold plate would then, in the liquid cool plate, liquid-cooled case would directly attach to the thermal interface material on top of the chips. So we do thermal modeling of that entire stack, and work directly with all of our technology partners to make sure that the power densities that we build into the chips can be handled by that entire thermal stack up.
George: Awesome, and since we're running short on time, the most important question of this interview is, what's your favorite type of cheese?
Alan: Oh, cheddar.
George: Massively agree with you. What's your favorite brand of cheddar?
Alan: I like the Vermont one. What is that, oh... Calbert's? I can't think of it. [Editor's note: It's Cabot Cheddar that is Alan's favorite]
George: I know my personal favorite's probably Tillamook, which is, yeah, from Oregon. But anyway, thank you so much, Alan, for this interview.
If you would like to support the channel, hit like, hit subscribe. And if you like interviews like this, tell us in the comments below. Also, there will be a transcript on the Chips and Cheese website. If you want to directly monetarily support Chips and Cheese, there's Patreon, as well as Stripe through Substack, and PayPal. So, thank you so much for that interview, Alan.
Alan: Thank you, my pleasure.
George: Have a good one, folks!