MoreRSS

site iconChips and CheeseModify

Deep dives into computer hardware and software and the wider industry...
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Chips and Cheese

AMD’s EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

2025-10-01 02:27:47

High core count chips are headline grabbers. But maxing out the metaphorical core count slider isn’t the only way to go. Server players like Intel, AMD, and Arm aim for scalable designs that cover a range of core counts. Not all applications can take advantage of the highest core count models in their lineups, and per-core performance still matters.

From AMD’s Turin whitepaper

AMD’s EPYC 9355P is a 32 core part. But rather than just being a lower core count part, the 9355P pulls levers to let each core count for more. First, it clocks up to 4.4 GHz. AMD has faster clocking chips in its server lineup, but 4.4 GHz is still a good bit higher than the 3.7 or 4.1 GHz that 128 or 192 core Zen 5 SKUs reach. Then, AMD uses eight CPU dies (CCDs) to house those 32 cores. Each CCD only has four cores enabled out of the eight physically present, but still has its full 32 MB of L3 cache usable. That provides a high cache capacity to core count ratio. Finally, each CCD connects to the IO die using a “GMI-Wide” setup, giving each CCD 64B/cycle of bandwidth to the rest of the system in both the read and write directions. GMI here stands for Global Memory Interconnect. Zen 5’s server IO die has 16 GMI links to support up to 16 CCDs for high core count parts, plus some xGMI (external) links to allow a dual socket setup. GMI-Wide uses two links per CCD, fully utilizing the IO die’s GMI links even though the EPYC 9355P only has eight CCDs.

Acknowledgments

Dell has kindly provided a PowerEdge R6715 for testing, and it came equipped with the aforementioned AMD EPYC 9355P along with 768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth. Besides providing a look into how a lower core count SKU behaves, we have BMC access which provides an opportunity to investigate different NUMA setups.

The provided Dell PowerEdge R6715 system racked and ready for testing. Credit for the image goes to Zach from ZeroOne.

We’d also like to thank Zack and the rest of the fine folks at ZeroOne Technology for hosting the Dell PowerEdge R6715 at no cost to us.

Memory Subsystem and NUMA Characteristics

NPS1 mode stripes memory accesses across all 12 of the chip’s memory controllers, presenting software with a monolithic view of memory at the cost of latency. DRAM latency in that mode is slightly better than what Intel’s Xeon 6 achieves in SNC3 mode. SNC3 on Intel divides the chip into three NUMA nodes that correspond to its compute dies. The EPYC 9355P has good memory latency in that respect, but it falls behind compared to the Ryzen 9 9900X with DDR5-5600. Interconnects that tie more agents together tend to have higher latency. On AMD’s server platform, the Infinity Fabric network within the IO die has to connect up to 16 CCDs with 12 memory controllers and other IO, so the higher latency isn’t surprising.

Cache performance is similar across AMD’s desktop and server Zen 5 implementations, with the server variant only losing because of lower clock speeds. That’s not a surprise because AMD reuses the same CCDs on desktop and server products. But it does create a contrast to Intel’s approach, where client and server memory subsystems differ starting at L3. Intel trades L3 latency for capacity and the ability to a logical L3 instance across more cores.

Different NUMA configurations can subdivide EPYC 9355P, associating cores with the closest memory controllers to improve latency. NPS2 divides the chip into two hemispheres, and has 16 cores form a NUMA node with the six memory controllers on one half of the die. NPS4 divides the chip into quadrants, each with two CCDs and three memory controllers. Finally, the chip can present each CCD as a NUMA node. Doing so makes it easier to pin threads to cores that share a L3 cache, but doesn’t affect how memory is interleaved across channels. Memory addresses are still assigned to memory controllers according to the selected NPS1/2/4 scheme, which is a separate setting.

NPS2 and NPS4 only provide marginal latency improvements, and latency remains much higher than in a desktop platform. At the same time, crossing NUMA boundaries comes with little penalty. Apparently requests can traverse the huge IO die quite quickly, adding 20-30 ns at worst. I’m not sure what the underlying Infinity Fabric topology looks like, but the worst case unloaded cross-node latencies were under 140 ns. On Xeon 6, latency starts higher and can climb over 180 ns when cores on one compute die access memory controllers on the compute die at the other end of the chip.

EPYC 9355P can get close to theoretical memory bandwidth in any of the three NUMA nodes, as long as code keeps accesses within each node. NPS2 and NPS4 offer slightly better bandwidth, at the cost of requiring code to be NUMA aware. I tried to cause congestion on Infinity Fabric by having cores on each NUMA node access memory on another. That does lower achieved bandwidth, but not by a huge amount.

By C0M1, I mean cores on node 0 accessing memory on node 1

An individual NPS4 node achieves 117.33 GB/s to its local memory pool, and just over 107 GB/s to the memory on the other three nodes. The bandwidth penalty is minor, but a bigger potential pitfall is lower bandwidth to each NUMA node’s memory pool. Two CCDs can draw more bandwidth than the three memory controllers they’re associated with. Manually distributing memory accesses across NUMA nodes can improve bandwidth for a workload contained within one NUMA node’s cores. But doing so in practice may be an intricate exercise.

In general, EPYC 9355P has very mild NUMA characteristics and little penalty associated with running the chip in NPS1 or NPS2 mode. I imagine just using NPS1 mode would work well enough in the vast majority of cases, with little performance to be gained from carrying out NUMA optimizations.

Looking into GMI-Wide

GMI-Wide is AMD’s attempt to address bandwidth pinch points between CCDs and the rest of the system. With GMI-Wide, a single CCD can achieve 99.8 GB/s of read bandwidth, significantly more than the 62.5 GB/s from a Ryzen 9 9900X CCD with GMI-Narrow. GMI-Wide also allows better latency control under high bandwidth load. The Ryzen 9 9900X suffers from a corner case where a single core pulling maximum bandwidth can saturate the GMI-Narrow link and starve out another latency sensitive thread. That sends latency to nearly 500 ns, as observed by a latency test thread sharing a CCD with a thread linearly traversing an array. Having more threads generate bandwidth load seems to make QoS mechanisms kick in, which slightly reduces bandwidth throughput but brings latency back under control.

I previously wrote about loaded memory latency on the Ryzen 9 9950X when testing the system remotely, and thought it controlled latency well under high bandwidth load. But back then, George (Cheese) set that system up with very fast DDR5-8000 along with a higher 2.2 GHz FCLK. A single core was likely unable to monopolize off-CCD bandwidth in that setup, avoiding the corner case seen on my system. GMI-Wide increases off-CCD bandwidth by a much larger extent and has a similar effect. Under increasing bandwidth load, GMI-WIde can both achieve more total bandwidth and control latency better than its desktop single-link counterpart.

A read-modify-write pattern gets maximum bandwidth from GMI-Wide by exercising both the read and write paths. It doesn’t scale perfectly, but it’s a substantial improvement over using only reads or writes. A Ryzen 9 9900X CCD can theoretically get 48B/cycle to the IO die with a 2:1 read-to-write ratio. I tried modifying every other cacheline to achieve this ratio, but didn’t get better bandwidth probably because the memory controller is limited by a 32B/cycle link to Infinity Fabric. However, mixing in writes does get rid of the single bandwidth thread corner case, possibly because a single thread doesn’t saturate the 32B/cycle read link when mixing reads and writes.

AMD’s slide shown for Zen 4 Desktop. Zen 5 Desktop uses the same IO die, and thus has the same Infinity Fabric setup

On the desktop platform, latency under high load gets worse possibly because writes contend with reads at the DRAM controller. The DDR bus is unidirectional, and must waste cycles on “turnarounds” to switch between read and write mode. Bandwidth isn’t affected, probably because the Infinity Fabric bottleneck leaves spare cycles at the memory controller, which can absorb those turnarounds. However, reads from the latency test thread may be delayed while the memory controller drains writes before switching the bus back to read mode.

On the EPYC 9355P in NPS1 mode, bandwidth demands from a single GMI-Wide CCD leave plenty of spare cycles across the 12 memory controllers, so there’s little latency or bandwidth penalty when mixing reads and writes. The same isn’t true in NPS4 mode, where a GMI-Wide link can outmatch a NPS4 node’s three memory controllers. Everything’s fine with just reads, which actually benefit possibly because of lower latency and not having to traverse as much of the IO die. But with a read-modify-write pattern, bandwidth drops from 134 GB/s in NPS1 mode to 96.6 GB/s with NPS4. Latency gets worse too, rising to 248 ns. Again, NPS4 is something to be careful with, particularly if applications might require high bandwidth from a small subset of cores.

SPEC CPU2017

From a single thread perspective, the EPYC 9355P falls some distance behind the Ryzen 9 9900X. Desktop CPUs are designed around single threaded performance, so that’s to be expected. But with boost turned off on the desktop CPU to match clock speeds, performance is surprisingly close. Higher memory latency still hurts the EPYC 9355P, but it’s within striking distance.

NUMA modes barely make any difference. NPS4 technically wins, but by an insignificant margin. The latency advantage was barely measurable anyway. Compared to the more density optimized Graviton 4 and Xeon 6 6975P-C, the EPYC 9355P delivers noticeably better single threaded performance.

CCD-level bandwidth pinch points are worth a look too, since that’s traditionally been a distinguishing factor between AMD’s EPYC and more logically monolithic designs. Here, I’m filling a quad core CCD by running SPEC CPU2017’s rate tests with eight copies. I did the same on the Ryzen 9 9900X, pinning the eight copies to four cores and leaving the CCD’s other two cores unused. I bound the test to a single NUMA node on all tested setups.

SPEC’s floating point suite starts to tell a different story now. Several tests within the floating point suite are bandwidth hungry even from a single core. 549.fotonik3d for example pulled 28.23 GB/s from Meteor Lake’s memory controller when I first went through SPEC CPU2017’s tests. Running eight copies in parallel would multiply memory bandwidth demands, and that’s where server memory subsystems shine.

In 549.fotonik3d, high bandwidth demands make the Ryzen 9 9900X’s unloaded latency advantage irrelevant. The 9900X even loses to Redwood Cove cores on Xeon 6. The EPYC 9355P does very well in this test against both the 9900X and Xeon 6. Intel’s interconnect strategy tries to keep the chip logically monolithic and doesn’t have pinch points at cluster boundaries. But each core on Xeon 6 can only get to ~33 GB/s of DRAM bandwidth at best, using an even mix of reads and writes. AMD’s GMI-Wide can more than match that, and Intel’s advantage doesn’t show through in this scenario. However, Intel does have a potential advantage against more density focused AMD SKUs where eight cores sit in front of a narrower link.

NPS4 is also detrimental to the EPYC 9355P’s performance in this test. It only provides a minimal latency benefit at the cost of lower per-node bandwidth. The bandwidth part seems to hurt here, and taking the extra latency of striping accesses across 6 or 12 memory controllers gives a notable performance improvement.

Final Words

Core count isn’t the last word in server design. A lot of scenarios are better served by lower core count parts. Applications might not scale to fill a high core count chip. Bandwidth bound workloads might not benefit from adding cores. Traditionally lower core count server chips just traded core counts for higher clock speeds. Today, chips like the EPYC 9355P do a bit more, using both wider CCD-to-IOD links and more cache to maximize per-core performance.

Looking at EPYC 9355P’s NUMA characteristics reveals very consistent memory performance across NUMA modes. Intel’s Xeon 6 may be more monolithic from a caching point of view, but AMD’s DRAM access performance feels more monolithic than Intel’s. AMD made a tradeoff back in the Zen 2 days where they took lower local memory latency in exchange for more even memory performance across the socket. Measured latencies on EPYC 9355P are a bit higher than figures on the Zen 2 slide above. DDR5 is higher latency, and the Infinity Fabric topology is probably more complex these days to handle more CCDs and memory channels.

From the AMD EPYC 9005 Processor Architecture Overview

But the big picture remains. AMD’s Turin platform handles well in NPS1 mode, and cross-node penalties are low in NPS2/NPS4 modes. Those characteristics likely carry over across the Zen 5 EPYC SKU stack. It’s quite different from Intel’s Xeon 6 platform, which places memory controllers on compute dies like Zen 1 did. For now, AMD’s approach seems to be better at the DRAM level. Intel’s theoretical latency advantage in SNC3 mode doesn’t show through, and AMD gets to reap the benefits of a hub-and-spoke model while not getting hit where it should have downsides.

AMD seems to have found a good formula back in the Zen 2 days, and they’re content with reinforcing success. Intel is furiously iterating to find a setup that preserves a single level, logically monolithic interconnect while scaling well across a range of core counts. And of course, there’s Arm chips, which generally lean towards a single level monolithic interconnect too. It’ll be interesting to watch what all of these players do going forward as they continue to iterate and refine their designs.

A big thanks to Zach from ZeroOne Technology for taking pictures of the system in the rack!

And again, we’d like to thank both Dell and ZeroOne for, respectively, providing and hosting this PowerEdge R6715 without both of whom this article wouldn’t have been possible. If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

A Look into Intel Xeon 6’s Memory Subsystem

2025-09-27 00:43:45

Intel’s server dominance has been shaken by high core count competition from the likes of AMD and Arm. Xeon 6, Intel’s latest server platform, aims to address this with a more scalable chiplet strategy. Chiplets are now arranged side by side, with IO chiplets on either side of the chip. It’s reminiscent of the arrangement Huawei used for Kunpeng 920, but apparently without the uniform chiplet height restrictions of Huawei’s design. Intel also scales up to three compute dies, as opposed to a maximum of two on Kunpeng 920.

Compared to Emerald Rapids and Granite Rapids, Xeon 6 uses a more aggressive chiplet strategy. Lower speed IO and accelerators get moved to separate IO dies. Compute dies contain just cores and DRAM controllers, and use the advanced Intel 3 process. The largest Xeon 6 SKUs incorporate three compute dies and two IO dies, scaling up to 128 cores per socket. It’s a huge core count increase from the two prior generations.

AWS has Xeon 6 instances generally available with their r8i virtual machine type, providing an opportunity to check out Intel’s latest chiplet design from a software performance perspective. This will be a short look. Renting a large cloud instance for more detailed testing is prohibitively expensive.

System Overview

AWS’s r8i instance uses the Xeon 6 6985P-C. This SKU is not listed on Intel’s site or other documentation, so a brief overview of the chip is in order. The Xeon 6 6985P-C has 96 Redwood Cove cores that clock up to 3.9 GHz, and each have 2 MB of L2 cache. Redwood Cove is a tweaked version of Intel’s Golden Cove/Raptor Cove, and has previously featured on Intel’s Meteor Lake client platform. Redwood Cove brings a larger 64 KB L1 instruction cache among other improvements discussed in another article. Unlike their Meteor Lake counterparts, Xeon 6’s Redwood Cove cores enjoy AVX-512 support with 2x 512-bit FMA units, and 2x512-bit load + 1x512-bit store throughput to the L1 data cache. AMX support is present as well, providing specialized matrix multiplication instructions to accelerate machine learning applications.

Xeon 6 uses a mesh interconnect like prior Intel server chips. Cores share a mesh stop with a CHA (Caching/Home Agent), which incorporates a L3 cache slice and a snoop filter. The Xeon 6 6985P-C has 120 CHA instances running at 2.2 GHz, providing 480 MB of total L3 across the chip. CHA count interestingly doesn’t match enabled core count. Intel may be able to harvest cores without disabling the associated cache slice, as they had to do on some older generations.

The mesh runs across die boundaries to keep things logically monolithic. Modular Data Fabric (MDF) mesh stops sit at die edges where dies meet with their neighbors. They handle the logical layer of the mesh protocol, much like how IFOP blocks on AMD chips encode the Infinity Fabric protocol for transport between dies. The physical signals on Xeon 6 run over Intel’s EMIB technology, which uses an embedded silicon bridge between dies. The Xeon 6 6985P-C has 80 MDF stops running at 2.5 GHz. Intel hasn’t published documents detailing Xeon 6’s mesh layout. One possibility is that 10 MDF stops sit at each side of a die boundary.

Intel places memory controllers at the short edge of the compute dies. The largest Xeon 6 SKUs have 12 memory controllers, or four per compute die. The Xeon 6 6985P-C falls into that category. AWS equipped the Xeon 6 6985P-C with 1.5 TB of Micron DDR5-7200 per socket. I couldn’t find the part number (MTC40F2047S1RC72BF1001 25FF) anywhere, but it’s certainly very fast DDR5 for a server platform. AWS has further configured the chip in SNC3 mode. SNC stands for sub-NUMA clustering, and divides the chip into three NUMA nodes. Doing so partitions the physical address space into three portions, each backed by the DRAM controllers and L3 cache on their respective compute dies. That maintains affinity between the cores, cache, and memory controllers on each die, reducing latency as long as cores access the physical memory backed by memory controllers on the same die. Xeon 6 also supports a unified mode where all of the attached DRAM is exposed in one NUMA node, with accesses striped across all of the chip’s memory controllers and L3 slices. However, SNC3 mode is what AWS chose, and is also Intel’s default mode.

Cache and Memory Latency

Xeon 6’s Redwood Cove cores have the same 5 cycle L1D and 16 cycle L2 latencies as their Meteor Lake counterparts, though the server part’s lower clock speeds mean higher actual latencies. Memory subsystem characteristics diverge at the L3 cache, which has a latency of just over 33 ns (~130 cycles). In exchange for higher latency, each core is able to access a huge 160 MB pool of L3 cache on its tile.

L3 latency regresses slightly compared to Emerald Rapids. It’s concerning because the Emerald Rapids Xeon Platinum 8559C chips in AWS’s i7i instances do not use SNC. Thus each core in Emerald Rapids sees the full 320 MB of L3 as a logically monolithic cache, and L3 accesses are distributed across both chiplets. The same applies to DRAM accesses. Emerald Rapids achieves lower latency to its DDR5-5600 Hynix memory (HMCGM4MGBRB240N).

Compared to Sapphire Rapids, Xeon 6’s L3 is a straightforward upgrade with higher capacity at similar latency. DRAM latency is still a bit worse on Xeon 6 though. Note: I originally mentioned MCRDIMM support here. From the DIMM part number above, AWS isn’t using MCRDIMMs so they’re not relevant to results in this article.

Next to AMD’s Zen 5 server platform, codenamed Turin, Xeon 6 continues to use larger but slower caches. Intel’s large L2 helps mitigate L3 latency, letting Intel use a larger L3 shared across more cores. AMD’s L3 is higher performance, but is also smaller and only shared across eight cores. That translates to less efficient SRAM capacity usage across the chip, because data may be duplicated in different L3 cache instances. For low threaded workloads, AMD may also be at a disadvantage because a single core can only allocate into its local 32 MB L3 block, even though the chip may have hundreds of megabytes of total L3. Even if AMD extends each CCD’s L3 with 3D stacking (VCache), 96 MB is still less than Intel’s 160 MB.

Memory latency on AMD’s platform is 125.6 ns, higher than Emerald Rapids but better than Xeon 6. For this test, the AMD system was set up in NPS1 mode, which distributes accesses across all of the chip’s 12 memory controllers. It’s not the most latency optimized mode, and it’s concerning that AMD achieves better memory latency without having to contain memory accesses in smaller divisions.

Bandwidth

Xeon 6’s core-private caches provide high bandwidth with AVX-512 loads. A single core has about 30 GB/s of L3 bandwidth, just under what Emerald Rapids could achieve. A read-modify-write pattern nearly doubles L3 bandwidth, much like it does on AMD’s Zen line. On Zen, that’s due to implementation details in the L3’s victim cache operation. A L3 access request can come with a L2 victim, and thus perform both a read and a write on the L3. That doubles achievable bandwidth with the same request rate. Intel may have adopted a similar scheme, though figuring that out would require a more thorough investigation. Or, Intel may have independent queues for pending L3 lookups and writebacks. Zooming back up, Zen 5 achieves high L3 bandwidth at the cost of smaller L3 instances shared across fewer cores. It’s the same story as with latency.

DRAM bandwidth from a single Xeon 6 core comes in at just below 20 GB/s. It’s lower than on Zen 5, but also less of a concern thanks to massive L3 capacity. As with L3 bandwidth, a read-modify-write pattern dramatically increases DRAM bandwidth achievable from a single core.

At the chip level, Xeon 6’s high core count and fast L1/L2 caches can provide massive bandwidth for code with good memory access locality. Total cache read bandwidth is a huge step above Emerald Rapids and smaller Zen 5 SKUs that prioritize per-core performance over throughput. Xeon 6’s L3 bandwidth improves to align with its core count increase, though L3 bandwidth in total remains behind AMD.

With each thread reading from memory backed by its local NUMA node, I got 691.62 GB/s of DRAM bandwidth. It’s a step ahead of Emerald Rapids’s 323.45 GB/s. Intel has really scaled up their server platform. Xeon 6’s 12 memory controllers and faster memory give it a huge leg up over Emerald Rapids’s eight memory controllers. AMD also uses 12 memory controllers, but the EPYC 9355P I had access to runs DDR5-5200, and achieved 478.98 GB/s at the largest test size using a unified memory configuration (single NUMA node per socket).

NUMA/Chiplet Characteristics

Intel uses consistent hashing to route physical addresses to L3 cache slice (CHAs), which then orchestrate coherent memory access. Xeon 6 appears to carry out this consistent hashing by distributing each NUMA node’s address space across L3 slices on the associated die, as opposed to covering the entire DRAM-backed physical address space with the CHAs on each die. Thus accesses to a remote NUMA node are only cached by the remote die’s L3. Accessing the L3 on an adjacent die increases latency by about 24 ns. Crossing two die boundaries adds a similar penalty, increasing latency to nearly 80 ns for a L3 hit.

Similar penalties apply to remote DRAM accesses, though that’s to be expected for any NUMA setup. Getting data from memory controllers across one die boundary takes 157.44 ns, for about 26 ns extra over hitting a local one. Crossing two die boundaries adds another 25 ns, bringing latency to 181.54 ns.

Under load, latency gradually increases until it reaches a plateau, before seeing a sharper rise as the test approaches DRAM bandwidth limits. It’s not a perfect hockey stick pattern like on some other chips. Latency remains reasonably well controlled, at just under 300 ns for a within-NUMA-node access under high bandwidth node. Crossing tile boundaries increases latency, but leaves bandwidth largely intact. Repeating the exercise with 96 MB used for bandwidth test threads and 16 MB for the latency test gives an idea of what cross-die L3 bandwidth is like. It’s very high. Cross-die bandwidth nearly reaches 500 GB/s for a read-only pattern.

I was able to exceed 800 GB/s of cross-die bandwidth by respecting Intel’s conga line topology, for lack of a better term. Specifically, I had cores from node 0 accessing L3 on node 1, and cores on node 1 accessing L3 on node 2. I used a read-modify-write pattern as well to maximize L3 bandwidth. It’s a rather contrived test, but it underscores the massive cross-die bandwidth present in Intel’s Xeon 6 platform. A single Xeon 6 compute die has more off-die bandwidth than an AMD CCD, even when considering GMI-Wide AMD setups that use two Infinity Fabric links between each CCD and the central IO die. However, each AMD L3 instance covers the physical address space across all NUMA nodes, so the situations aren’t directly comparable.

Coherency Latency

The memory subsystem may have to orchestrate transfers between peer caches, which I test by bouncing a value between two cores. That “core to core latency” shows similar characteristics to prior Xeon generations. Transfers within a compute die land in the 50-80 ns range, with only a slight increase when crossing die boundaries. NUMA characteristics above apply here too. I ran this test with memory allocated out of the first NUMA node.

A pair of cores on NUMA node 2 would need a CHA on node 0 to orchestrate the transfer, which is why nodes farther away from the first compute die show higher latency in this test. The worst case of 100-120 ns with cores going through a CHA on the farthest NUMA node is still better than the 150-180 ns on AMD server platforms when crossing cluster (CCX) boundaries. In both cases, the round trip goes across two die boundaries. But Intel’s logically monolithic topology and advanced packaging technologies likely give it a lead.

Single Core Performance: SPEC CPU2017

SPEC CPU2017 is an industry standard benchmark, and I’ve been running the rate suite with a single copy to summarize single-thread performance. This 96 core Xeon 6 part falls behind lower core count chips that are better optimized for per-core performance. Its score in the integer suite is closely matched to AWS’s Graviton 4, which similarly implements 96 cores per socket. Xeon 6 enjoys a slight 8.4% lead over Graviton 4 in the floating point suite.

Final Words

Intel’s latest server chiplet strategy is superficially similar to AMD’s. Both use compute and IO dies. Both strive for a scalable and modular design that can scale across a range of core counts. But Intel and AMD have very different system level design philosophies, driving huge architectural differences under the hood. Intel wants to present a logically monolithic chip, and wants their mesh to accommodate an ever increasing number of agents. It’s not an easy task, because it demands very high performance from cross-die transfers.

Slide shown with Emerald Rapids, highlighting Intel’s goal of maintaining a logically monolithic design via a large mesh interconnect spanning die boundaries. Xeon 6 has similar goals

AMD in contrast doesn’t try to make a big system feel monolithic, and instead goes with two interconnect levels. A fast intra-cluster network handles high speed L3 traffic and presents the cluster as one client to Infinity Fabric. Infinity Fabric only handles DRAM and slower IO traffic, and thus faces a less demanding job than Intel’s mesh. Infinity Fabric does cross die boundaries, but gets by with on-package traces thanks to its lower performance requirements.

Intel’s approach has advantages, even when subdivided using SNC3 mode. These include a more unified and higher capacity L3 cache, better core to core latency, and no core cluster boundaries with bandwidth pinch points. And theoretically, placing memory controllers on compute dies allows SNC3 mode to deliver lower DRAM latency by avoiding cross-die hops.

But every design comes with tradeoffs. The DRAM latency advantage doesn’t materialize against AMD’s Turin. L3 performance isn’t great. Latency is high, and a Redwood Cove core has less L3 bandwidth than a Zen 5 core does from DRAM. I have to wonder how well the L3 would perform in a Xeon 6 chip configured to run as a single NUMA node too. L3 accesses would start with a baseline latency much higher than 33 ns. Getting hard numbers would be impossible without access to a Xeon 6 chip in such a configuration, but having 2/3 of accesses cross a die boundary would likely send latency to just under 50 ns. That is, (2 * 57.63 + 33.25) / 3 = 49.5 ns. Scalability is another question too. Extending the conga line with additional compute dies may erode Intel’s advantage of crossing fewer die boundaries on average than AMD. The best way to keep scaling within such a topology would likely be increasing core counts within each die, which could make it harder to scale to smaller configurations using the same dies.

I wonder if keeping a logically monolithic design is worth the tradeoffs. I’m not sure if anyone has built a mesh this big, and other large meshes have run into latency issues even with techniques like having core pairs share a mesh stop. Nvidia’s Grace for example uses a large mesh and sees L3 latency reach 38.2 ns. Subdividing the chip mitigates latency issues but starts to erode the advantages of a logically monolithic setup. But of course, no one can answer the question of whether the tradeoffs are worthwhile except the engineers at Intel. Those engineers have done a lot of impressive things with Xeon 6. Going forward, Intel has a newer Lion Cove core that’s meant to compete more directly with Zen 5. Redwood Cove was more contemporary with Zen 4. I look forward to seeing what Intel does at the system level with their next server design. Those engineers are no doubt busy cooking it up.

AMD's Inaugural Tech Day ft. ROCm 7, Modular, and AMD Lab Tour

2025-09-23 04:07:01

Hello you fine Internet folks,

I was invited to AMD's Austin Headquarters where AMD held their inaugural Tech Day where AMD announced ROCm 7, Modular showed off their results with MI355X, which was topped off by a tour of AMD's Labs.

Hope y'all enjoy!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Everactive’s Self-Powered SoC at Hot Chips 2025

2025-09-18 01:25:23

Hot Chips 2025 was filled with presentations on high power devices. AI can demand hundreds of watts per chip, requiring sophisticated power delivery and equally complex cooling. It’s a far cry from consumer devices that must make do with standard wall outlets, or batteries for mobile devices. But if working within a wall outlet or battery’s constraints is hard, it’s nothing compared to working with, well, nothing. That’s Everactive’s focus, and the subject of their presentation at Hot Chips 2025.

Everactive makes SoCs that power themselves off energy harvested from the environment, letting them operate without a steady power source. Self-powered SoCs are well suited for large IoT deployments. Wiring power to devices or replacing batteries complicates scaling device count, especially when those IoT devices are sensors in difficult to reach places. Generating energy off the environment is also attractive from a sustainability standpoint. Reliability can benefit too, because the device won’t suffer from electrical grid failures.

But relying on energy harvesting brings its own challenges. Harvested energy is often in the milliwatt or microwatt range, depending on energy source. Power levels fluctuate depending on environmental conditions too. All that means the SoC has to run at very low power while taking measures to maximize harvested power and survive through suboptimal conditions.

SoC Overview

Everactive’s SoC is designated PKS3000 and occupies 6.7 mm² on a 55 nm ULP process node. It’s designed around collecting and transmitting data within the tight constraints of self-powered operation. The SoC can run at 5 MHz with 12 microwatt power draw, and has a 2.19 microwatt power floor. It interfaces with a variety of sensors to gather data, and has a power-optimized Wifi/Bluetooth/5G radio for data transmission. An expansion port can connect to an external microcontroller and storage, which can also be powered off harvested energy.

On the processing side, the SoC includes an Arm Cortex M0+ microcontroller. Cortex M0+ is one Arm's lowest power cores, and comes with a simple two-stage pipeline. Its memory subsystem is similarly bare-bones, with no built-in caches and a memory port shared between instruction and data accesses. The chip includes 128 KB of SRAM and 256 KB of flash. Thus the SoC has less storage and less system memory than the original IBM PC but can clock higher which does go to show just how far silicon manufacturing has come in nearly 50 years.

EH-PMU: Orchestrating Power Delivery

Energy harvesting is central to the chip’s goals, and is controlled by the Energy Harvesting Power Management Unit (EH-PMU). The EH-PMU uses a multiple input, single inductor, multiple output (MISIMO) topology, letting it simultaneously harvest energy from two energy sources and output power on four rails. For energy harvesting sources, Everactive gives light and temperature differences as examples because they’re often available. Light can be harvested from a photovoltaic cell (PV), and a thermoelectric generator (TEG) can do so from temperature differences. Depending on expected environmental conditions, the SoC can be set up with other energy sources like radio emissions, mechanical vibrations, or airflow.

Maximum power point tracking (MPPT) helps the EH-PMU improve energy harvesting efficiency by getting energy from each harvesting source at its optimal voltage. Because harvested energy can often be unstable despite using two sources and optimization techniques, the PKS3000 can store energy in a pair of capacitors. A supercapacitor provides deep energy storage, and aims to let the chip keep going under adverse conditions, A smaller capacitor charges faster, and can quickly collect power to speed up cold starts after brownouts. Everactive’s SoC can cold-start using the PV/TEG combo at 60 lux and 8 C, which is about equivalent to indoor lighting in a chilly room.

The EH-PMU’s powers four output rails, named 1p8, 1p2, 0p9, and adj, which can be fed off a harvesting source under good conditions. Under bad conditions, EH-PMU can feed them off stored energy in the capacitors. A set of pulse counters monitor energy flow throughout the chip. Collected energy statistics are fed to the Energy Aware Subsystem.

Energy Aware Subsystem

The Energy Aware Subsystem (EAS) monitors energy harvesting, storage, and consumption to make power management decisions. It plays a role analogous to Intel’s Power Control Unit or AMD’s System Management Unit (SMU). Like Intel’s PCU or AMD’s SMU, the EAS manages frequency and voltage scaling with a set of different policies. Firmware can hook into the EAS to set maximum frequencies and power management policies, just as an OS would do on higher power platforms. Energy statistics can be used to decide when to enable components or perform OTA updates.

The EAS also controls a load switch that can be used to cut off external components. Everactive found that some external components have a high power floor. Shutting them off prevents them from monopolizing the chip’s very limited power budget. Components can be notified and allowed to perform a graceful shutdown. But the EAS also has the power to shut them off in a non-cooperative fashion. That gives the EAS more power than Intel’s PCU or AMD’s SMU, which can’t independently decide that some components should be shut off. But the EAS needs those powers because it has heavier responsibilities and constraints. For Everactive’s SoC, poor power management isn’t a question of slightly higher power bills or shorter battery life. Instead, it could cause frequent brownouts that prevent the SoC from providing timely sensor data updates. That’s highly non-ideal for industrial monitoring applications, where missed data could prevent operators from noticing a problem.

The EAS supports a Wait-for-Wakeup mode, which lets the device enter a low power mode where it can still respond very quickly to activity from sensors or the environment. Generally, Everactive’s SoC is geared towards trying to idle as often as possible.

Wake-Up Radio

Idle optimizations extend to the radio. Communication is a challenge for self powered SoCs because radios can have a high power floor even if all they’re doing is staying connected to a network. Disconnecting is an option, but it’s a poor one because reconnecting introduces delays, and some access points don’t reliably handle frequent reconnects. Everactive attacks this problem with a Wake-Up Radio (WRX), which uses an always-on receiver capable of receiving a subset of messages at very low power. Compared to a conventional radio that uses duty cycling, wakeup receivers seek to reduce power wasted on idle monitoring while achieving lower latency.

Everactive’s WRX shares an antenna with an external communication transceiver, which the customer picks with their network design in mind. A RF switch lets the WRX share the antenna, and on-board matching networks pick the frequency. The passive path can operate from 300 MHz to 3 GHz, letting it support a wide range of standards with the appropriate on-board matching networks. A broadband signal is input to the wakeup receiver on the passive path, which does energy detection. Then, the WRX can do baseband gain or intermediate frequency (IF) gain depending on the protocol.

WRX power varies with configuration. Using the passive path without RF gain provides a very low baseline power of under a microwatt while providing respectable -63 dBm sensitivity for a sub-GHz application. That mode provides about 200m range, making it useful for industrial monitoring environments. Longer range applications (over 1000m) require more power, because RF boost has to get involved to increase sensitivity. To prevent high active power from becoming a problem, Everactive uses multi-stage wakeup and very fine grained duty cycling, letting the radio sample at times when it might get a bit of a wakeup message. With those techniques, the chip is able to achieve -92 dBm sensitivity while keeping power below 6 microwatts on average.

For comparison, Intel’s Wi-Fi 6 AX201 can achieve similar sensitivities, but idles at 1.6 mW in Core Power Down. Power goes up to 3.4 mW when associated with a 2.4 GHz access point. That sort of power draw is still very low in an absolute sense. But Everactive’s WRX setup pushes even lower, which is both impressive and reiterates the challenges associated with self powered operation. Everactive for their part sees standards moving in the direction of using wake-up radios. Wake-up receivers have been researched for decades, and continue to improve over time. There’s undeniably a benefit to idle power, and it’ll be interesting to see if WRX-es get more widely adopted.

Final Words

Everactive’s PKS3000 is a showcase of extreme power saving measures. The goal of improving power efficiency is thrown around quite a lot. Datacenter GPUs strive for more FLOPS/watt. Laptop chips get power optimizations with each generation. But self powered SoCs really take things to the next level because their power budget is so limited. Many of Everactive’s optimizations shave off power on a microwatt scale, an order of magnitude below the milliwatts that mobile devices care about. PKS3000 can idle at 2.19 microwatts with some limitations, or under 4 microwatts with the wakeup receiver always on. Even under load, Everactive’s SoC draws order of magnitudes less power than the chips PC enthusiasts are familiar with, let alone the AI-oriented monster chips capable of drawing a kilowatt or more.

The PKS3000 improves over Everactive’s own PKS2001 SoC as well, reducing power while running at higher clocks and achieving better radio sensitivity. Pushing idle power down from 30 to 2.19 microwatts is impressive. Decreasing active power from 89.1 to 12 microwatts is commendable too. PKS3000 does move to a more advanced 55nm ULP node, compared to the 65nm node used by PKS2001. But a lot of improvements no doubt come from architectural techniques too.

And it’s important to not lose sight of the big picture. Neither SoC uses a cutting edge FinFET node, yet they’re able to accomplish their task with miniscule power requirements. Self powered SoCs have limitations of course, and it’s easy to see why Everactive focuses on industrial monitoring applications. But I do wonder if low power SoCs could cover a broader range of use cases as the technology develops, or if more funding lets them use modern process nodes. Everactive’s presentation was juxtaposed alongside talks on high power AI setups. Google talked about managing power swings on the megawatt scale with massive AI training deployments. Meta discussed how they increased per-rack power to 93.5 kW. Power consumption raises sustainability concerns with the current AI boom, and battery-less self-powered SoCs are so sweet from a sustainability perspective. I would love to see energy harvesting SoCs take on more tasks.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. Cortex M0+ from Arm: https://developer.arm.com/Processors/Cortex-M0-Plus

  2. Jesús ARGOTE-AGUILAR's thesis, "Powering Low-Power Wake-up Radios with RF Energy Harvesting": https://theses.hal.science/tel-05100987v1/file/ARGOTE_Jesus.pdf - gives an example of a wakeup receiver with 0.2 microwatt power consumption and -53 dBm sensitivity, and other examples with power going into the nanowatt range when no data is received.

  3. R. van Langevelde et al, "An Ultra-Low-Power 868/915 MHz RF Transceiver for Wireless Sensor Network Applications" at https://picture.iczhiku.com/resource/ieee/shkGpPyIRjUikxcX.pdf - ~1.2 microwatt sleep, 2.4 milliwatts receive, 2.7 milliwatts transmit, -89 dbM sensitivity

  4. Nathan M. Pletcher et al, "A 52 μW Wake-Up Receiver With 72 dBm Sensitivity Using an Uncertain-IF Architecture" - https://people.eecs.berkeley.edu/~pister/290Q/Papers/Radios/Pletcher%20WakeUp%20jssc09.pdf - old WRX from 2009

  5. Jonathan K. Brown eta al, "A 65nm Energy-Harvesting ULP SoC with 256kB Cortex-M0 Enabling an 89.1µW Continuous Machine Health Monitoring Wireless Self-Powered System" - paper on Everactive's older SoC

  6. Intel® Wi-Fi 6 AX201 (Harrison Peak 2) and Wi-Fi 6 AX101 (Harrison Peak 1) External Product Specifications (EPS)

AMD’s RDNA4 GPU Architecture at Hot Chips 2025

2025-09-14 05:01:39

RDNA4 is AMD’s latest graphics-focused architecture, and fills out their RX 9000 line of discrete GPUs. AMD noted that creating a good gaming GPU requires understanding both current workloads, as well as taking into account what workloads might look like five years in the future. Thus AMD has been trying to improve efficiency across rasterization, compute, and raytracing. Machine learning has gained importance including in games, so AMD’s new GPU architecture caters to ML workloads as well.

From AMD’s perspective, RDNA4 represents a large efficiency leap in raytracing and machine learning, while also improving on the rasterization front. Improved compression helps keep the graphics architecture fed. Outside of the GPU’s core graphics acceleration responsibility, RDNA4 brings improved media and display capabilities to round out the package.

Media Engine

The Media Engine provides hardware accelerated video encode and decode for a wide range of codecs. High end RDNA4 parts like the RX 9070XT have two media engines. RDNA4’s media engines feature faster decoding speed, helping save power during video playback by racing to idle. For video encoding, AMD targeted better quality in H.264, H.265, and AV1, especially in low latency encoding.

Low latency encoder modes are mostly beneficial for streaming, where delays caused by the media engine ultimately translate to a delayed stream. Reducing latency can make quality optimizations more challenging. Video codecs strive to encode differences between frames to economize storage. Buffering up more frames gives the encoder more opportunities to look for similar content across frames, and lets it allocate more bitrate budget for difficult sequences. But buffering up frames introduces latency. Another challenge is some popular streaming platforms mainly use H.264, an older codec that’s less efficient than AV1. Newer codecs are being tested, so the situation may start to change as the next few decades fly by. But for now, H.264 remains important due to its wide support.

Testing with an old gameplay clip from Elder Scrolls Online shows a clear advantage for RDNA4’s media engine when testing with the latency-constrained VBR mode and encoder tuned for low latency encoding (-usage lowlatency -rc vbr_latency). Netflix’s VMAF video quality metric gives higher scores for RDNA4 throughout the bitrate range. Closer inspection generally agrees with the VMAF metric.

RDNA4 does a better job preserving high contrast outlines. Differences are especially visible around text, which RDNA4 handles better than its predecessor while using a lower bitrate. Neither result looks great with such a close look, with blurred text on both examples and fine detail crushed in video encoding artifacts. But it’s worth remembering that the latency-constrained VBR mode uses a VBV buffer of up to three frames, while higher latency modes can use VBV buffer sizes covering multiple seconds of video. Encoding speed has improved slightly as well, jumping from ~190 to ~200 FPS from RDNA3.5 to RDNA4.

Display Engine

The display engine fetches on-screen frame data from memory, composites it into a final image, and drives it to the display outputs. It’s a basic task that most people take for granted, but the display engine is also a good place to perform various image enhancements. A traditional example is using a lookup table to apply color correction. Enhancements at the display engine are invisible to user software, and are typically carried out in hardware with minimal power cost. On RDNA4, AMD added a “Radeon Image Sharpening” filter, letting the display engine sharpen the final image. Using dedicated hardware at the display engine instead of the GPU’s programmable shaders means that the sharpening filter won’t impact performance and can be carried out with better power efficiency. And, AMD doesn’t need to rely on game developers to implement the effect. Sharpening can even apply to the desktop, though I’m not sure why anyone would want that.

Power consumption is another important optimization area for display engines. Traditionally that’s been more of a concern for mobile products, where maximizing battery life under low load is a top priority. But RDNA4 has taken aim at multi-monitor idle power with its newer display engine. AMD’s presentation stated that they took advantage of variable refresh rates on FreeSync displays. They didn’t go into more detail, but it’s easy to imagine what AMD might be doing. High resolution and high refresh rate displays translate to high pixel rates. That in turn drives higher memory bandwidth demands. Dynamically lowering refresh rates could let RDNA4’s memory subsystem enter a low power state while still meeting refresh deadlines.

Power and GDDR6 data rates for various refresh rate combinations. AMD’s monitoring software (and others) read out extremely low memory clocks when the memory bus is able to idle, so those readings aren’t listed.

I have a RX 9070 hooked up to a Viotek GN24CW 1080P display via HDMI, and a MSI MAG271QX 1440P capable of refresh rates up to 360 Hz. The latter is connected via DisplayPort. The RX 9070 manages to keep memory at idle clocks even at high refresh rate settings. Moving the mouse causes the card to ramp up memory clocks and consume more power, hinting that RDNA4 is lowering refresh rates when screen contents don’t change. Additionally, RDNA4 gets an intermediate GDDR6 power state that lets it handle the 1080P 60 Hz + 1440P 240 Hz combination without going to maximum memory clocks. On RDNA2, it’s more of an all or nothing situation. The older card is more prone to ramping up memory clocks to handle high pixel rates, and power consumption remains high even when screen contents don’t change.

Compute Changes

RDNA4’s Workgroup Processor retains the same high level layout as prior RDNA generations. However, it gets major improvements targeted towards raytracing, like improved raytracing units and wider BVH nodes, a dynamic register allocation mode, and a scheduler that no longer suffers false memory dependencies between waves. I covered those in previous articles. Besides those improvements, AMD’s presentation went over a couple other details worth discussing.

Scalar Floating Point Instructions

AMD has a long history of using a scalar unit to offload operations that are constant across a wave. Scalar offload saves power by avoiding redundant computation, and frees up the vector unit to increase performance in compute-bound sequences. RDNA4’s scalar unit gains a few floating point instructions, expanding scalar offload opportunities. This capability debuted on RDNA3.5, but RDNA4 brings it to discrete GPUs.

While not discussed in AMD’s presentation, scalar offload can bring additional performance benefits because scalar instructions sometimes have lower latency than their vector counterparts. Most basic vector instructions on RDNA4 have 5 cycle latency. FP32 adds and multiples on the scalar unit have 4 cycle latency. The biggest latency benefits still come from offloading integer operations though.

Split Barriers

GPUs use barriers to synchronize threads and enforce memory ordering. For example, a s_barrier instruction on older AMD GPUs would cause a thread to wait until all of its peers in the workgroup also reached the s_barrier instruction. Barriers degrade performance because any thread that happened to reach the barrier faster would have to stall until its peers catch up.

RDNA4 splits the barrier into separate “signal” and “wait” actions. Instead of s_barrier, RDNA4 has s_barrier_signal and s_barrier_wait. A thread can “signal” the barrier once it produces data that other threads might need. It can then do independent work, and only wait on the barrier once it needs to use data produced by other threads. The s_barrier_wait will then stall the thread until all other threads in the workgroup have signalled the barrier.

Memory Subsystem

The largest RDNA4 variants have a 8 MB L2 cache, representing a substantial L2 capacity increase compared to prior RDNA generations. RDNA3 and RDNA2 maxed out at 6 MB and 4 MB L2 capacities, respectively. AMD found that difficult workloads like raytracing benefit from the larger L2. Raytracing involves pointer chasing during BVH traversal, and it’s not surprising that it’s more sensitive to accesses getting serviced from the slower Infinity Cache as opposed to L2. In the initial scene in 3DMark’s DXR feature test, run in Explorer Mode, RDNA4 dramatically cuts down the amount of data that has to be fetched from beyond L2.

RDNA2 still does a good job of keeping data in L2 in absolute terms. But it’s worth noting that hitting Infinity Cache on both platforms adds more than 50 ns of extra latency over a L2 hit. That’s well north of 100 cycles because both RDNA2 and RDNA4 run above 2 GHz. While AMD’s graphics strategy has shifted towards making the faster caches bigger, it still contrasts with Nvidia’s strategy of putting way more eggs in the L2 basket. Blackwell’s L2 cache serves the functions of both AMD’s L2 and Infinity Cache, and has latency between those two cache levels. Nvidia also has a flexible L1/shared memory allocation scheme that can give them more low latency caching capacity in front of L2, depending on a workload’s requested local storage (shared memory) capacity.

A mid-level L1 cache was a familiar fixture on prior RDNA generations. It’s conspicuously missing from RDNA4, as well as AMD’s presentation. One possibility is that L1 cache hitrate wasn’t high enough to justify the complexity of an extra cache level. Perhaps AMD felt its area and transistor budget was better allocated towards increasing L2 capacity. To support this theory, L1 hitrate on RDNA1 was often below 50%. At the same time, the RDNA series always enjoyed a high bandwidth and low latency L2. Putting more pressure on L2 in exchange for reducing L2 misses may have been an enticing tradeoff. Another possibility is that AMD ran into validation issues with the L1 cache and decided to skip it for this generation. There’s no way to verify either possibility of course, but I think the former reasons make more sense.

Beyond tweaking the cache hierarchy, RDNA4 brings improvements to transparent compression. AMD emphasized that they’re using compression throughout the SoC, including at points like the display engine and media engine. Compressed data can be stored in caches, and decompressed before being written back to memory. Compression cuts down on data transfer, which reduces bandwidth requirements and improves power efficiency.

Transparent compression is not a new feature. It has a long history of being one tool in the GPU toolbox for reducing memory bandwidth usage, and it would be difficult to find any modern GPU without compression features of some sort. Even compression in other blocks like the display engine have precedent. Intel’s display engines for example use Framebuffer Compression (FBC), which can write a compressed copy of frame data and keep fetching the compressed copy to reduce data transfer power usage as long as the data doesn’t change. Prior RDNA generations had compression features too, and AMD’sdocumentation summarizes some compression targets. While AMD didn’t talk about compression efficiency, I tried to take similar frame captures using RGP on both RDNA1 and RDNA4 to see if there’s a large difference in memory access per frame. It didn’t quite work out the way I expected, but I’ll put them here anyway and discuss why evaluating compression efficacy is challenging.

The first challenge is that both architectures satisfy most memory requests from L0 or L1. AMD slides on RDNA1 suggest the L0 and L1 only hold decompressed data, at least for delta color compression. Compression does apply to L2. For RDNA4, AMD’s slides indicate it applies to the Infinity Cache too. However, focusing on data transfer to and from the L2 wouldn’t work due the large cache hierarchy differences between those RDNA generations.

DCC, or delta color compression, is not the only form of compression. But this slide shows one example of compression/decompression happening in front of L2

Another issue is, it’s easy to imagine a compression scheme that doesn’t change the number of cache requests involved. For example, data might be compressed to only take up part of a cacheline. A request only causes a subset of the cacheline to be read out, which a decompressor module expands to the full 128B. Older RDNA1 slides are ambiguous about this, indicating that DCC operates on 256B granularity (two cachelines) without providing further details.

In any case, compression may be a contributing factor in RDNA4 being able to achieve better performance while using a smaller Infinity Cache than prior generations, despite only having a 256-bit GDDR6 DRAM setup.

SoC Features

AMD went over RAS, or reliability, availability, and serviceability features in RDNA4. Modern chips use parity and ECC to detect errors and correct them, and evidently RDNA4 does the same. Unrecoverable errors are handled with driver intervention, by “re-initializing the relevant portion of the SoC, thus preventing the platform from shutting down”. There’s two ways to interpret that statement. One is that the GPU can be re-initialized to recover from hardware errors, obviously affecting any software relying on GPU acceleration. Another is that some parts of the GPU can be re-initialized while the GPU continues handling work. I think the former is more likely, though I can imagine the latter being possible in limited forms too. For example, an unrecoverable error reading from GDDR6 can hypothetically be fixed if that data is backed by a duplicate in system memory. The driver could transfer known-good data from the host to replace the corrupted copy. But errors with modified data would be difficult to recover from, because there might not be an up-to-date copy elsewhere in the system.

On the security front, microprocessors get private buses to “critical blocks” and protected register access mechanisms. Security here targets HDCP and other DRM features, which I don’t find particularly amusing. But terminology shown on the slide is interesting, because MP0 and MP1 are also covered in AMD’s CPU-side documentation. On the CPU side, MP0 (microprocessor 0) handles some Secure Encrypted Virtualization (SEV) features. It’s sometimes called the Platform Security Processor (PSP) too. MP1 on CPUs is called the System Management Unit (SMU), which covers power control functions. Curiously AMD’s slide labels MP1 and the SMU separately on RDNA4. MP0/MP1 could have completely different functions on GPUs of course. But the common terminology raises the possibility that there’s a lot of shared work between CPU and GPU SoC design. RAS is also a very traditional CPU feature, though GPUs have picked up RAS features over time as GPU compute picked up steam.

Infinity Fabric

One of the most obvious examples of shared effort between the CPU and GPU sides is Infinity Fabric making its way to graphics designs. This started years ago with Vega, though back then using Infinity Fabric was more of an implementation detail. But years later, Infinity Fabric components provided an elegant way to implement a large last level cache, or multi-socket coherent systems with gigantic iGPUs (like MI300A).

Slide from Hot Chips 29, covering Infinity Fabric used in AMD’s older Vega GPU

The Infinity Fabric memory-side subsystem on RDNA4 consists of 16 CS (Coherent Station) blocks, each paired with a Unified Memory Controller (UMC). Coherent Stations receive requests coming off the graphics L2 and other clients. They ensure coherent memory access by either getting data from a UMC, or by sending a probe if another block has a more up-to-date copy of the requested cacheline. The CS is a logical place to implement a memory side cache, and each CS instance has 4 MB of cache in RDNA4.

To save power, Infinity Fabric supports DVFS (dynamic voltage and frequency scaling) to save power, and clocks between 1.5 and 2.5 GHz. Infinity Fabric bandwidth is 1024 bytes per clock, which suggests the Infinity Cache can provide 2.5 TB/s of theoretical bandwidth. That roughly lines up with results from Nemes’s Vulkan-based GPU cache and memory bandwidth microbenchmark.

AMD also went over their ability to disable various SoC components to harvest dies and create different SKUs. Shader Engines, WGPs, and memory controller channels can be disabled. AMD and other manufacturers have used similar harvesting capabilities in the past. I’m not sure what’s new here. Likely, AMD wants to re-emphasize their harvesting options.

Finally, AMD mentioned that they chose a monolithic design for RDNA4 because it made sense for a graphics engine of its size. They looked at performance goals, package assembly and turnaround time, and cost. After evaluating those factors, they decided a monolithic design was the right option. It’s not a surprise. After all, AMD used monolithic designs for lower end RDNA3 products with smaller graphics engines, and only used chiplets for the largest SKUs. Rather, it’s a reminder that there’s no one size fits all solution. Whether a monolithic or chiplet-based design makes more sense depends heavily on design goals.

Final Words

RDNA4 brings a lot of exciting improvements to the table, while breaking away from any attempt to tackle the top end performance segment. Rather than going for maximum performance, RDNA4 looks optimized to improve efficiency over prior generations. The RX 9070 offers similar performance to the RX 7900XT in rasterization workloads despite having a lower power budget, less memory bandwidth, and a smaller last level cache. Techspot also shows the RX 9070 leading with raytracing workloads, which aligns with AMD's goal of enhancing raytracing performance.

Slide from RDNA4’s Launch Presentation not Hot Chips 2025

AMD achieves this efficiency using compression, better raytracing structures, and a larger L2 cache. As a result, RDNA4 can pack its performance into a relatively small 356.5 mm² die and use a modest 256-bit GDDR6 memory setup. Display and media engine improvements are welcome too. Multi-monitor idle power feels like a neglected area for discrete GPUs, even though I know many people use multiple monitors for productivity. Lowering idle power in those setups is much appreciated. On the media engine side, AMD’s video encoding capabilities have often lagged behind the competition. RDNA4’s progress at least prevents AMD from falling as far behind as they have before.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Intel’s E2200 “Mount Morgan” IPU at Hot Chips 2025

2025-09-11 05:35:12

Intel’s IPUs, or Infrastructure Processing Units, evolved as network adapters developed increasingly sophisticated offload capabilities. IPUs take things a step further, aiming to take on a wide variety of infrastructure services in a cloud environment in addition to traditional software defined networking functions. Infrastructure services are run by the cloud operator and orchestrate tasks like provisioning VMs or collecting metrics. They won’t stress a modern server CPU, but every CPU core set aside for those tasks is one that can’t be rented out to customers. Offloading infrastructure workloads also provides an extra layer of isolation between a cloud provider’s code and customer workloads. If a cloud provider rents out bare metal servers, running infrastructure services within the server may not even be an option.

Intel’s incoming “Mount Morgan” IPU packs a variety of highly configurable accelerators alongside general purpose CPU cores, and aims to capture as many infrastructure tasks as possible. It shares those characteristics with its predecessor, “Mount Evans”. Flexibility is the name of the game with these IPUs, which can appear as a particularly capable network card to up to four host servers, or run standalone to act as a small server. Compared to Mount Evans, Mount Morgan packs more general purpose compute power, improved accelerators, and more off-chip bandwidth to support the whole package.

Compute Complex

Intel includes a set of Arm cores in their IPU, because CPUs are the ultimate word in programmability. They run Linux and let the IPU handle a wide range of infrastructure services, and ensure the IPU stays relevant as infrastructure requirements change. Mount Morgan’s compute complex gets an upgrade to 24 Arm Neoverse N2 cores, up from 16 Neoverse N1 cores in Mount Evans. Intel didn’t disclose the exact core configuration, but Mount Evans set its Neoverse N1 cores up with 512 KB L2 caches and ran them at 2.5 GHz. It’s not the fastest Neoverse N1 configuration around, but it’s still nothing to sneeze at. Mount Morgan of course takes things further. Neoverse N1 is a 5-wide out-of-order core with a 160 entry ROB, ample execution resources, and a very capable branch predictor. Each core is already a substantial upgrade over Neoverse N1. 24 Neoverse N2 cores would be enough to handle some production server workloads, let alone a collection of infrastructure services.

Mount Morgan gets a memory subsystem upgrade to quad channel LPDDR5-6400 to feed the more powerful compute complex. Mount Evans had a triple channel LPDDR4X-4267 setup, connected to 48 GB of onboard memory capacity. If Intel keeps the same memory capacity per channel, Mount Morgan would have 64 GB of onboard memory. Assuming Intel’s presentation refers to 16-bit LPDDR4/5(X) channels, Mount Morgan would have 51.2 GB/s of DRAM bandwidth compared to 25.6 GB/s in Mount Evans. Those figures would be doubled if Intel refers to 32-bit data buses to LPDDR chips, rather than channels. A 32 MB System Level Cache helps reduce pressure on the memory controllers. Intel didn’t increase the cache’s capacity compared to the last generation, so 32 MB likely strikes a good balance between hitrate and die area requirements. The System Level Cache is truly system level, meaning it services the IPU’s various hardware acceleration blocks in addition to the CPU cores.

A Lookaside Crypto and Compression Engine (LCE) sits within the compute complex, and shares lineage with Intel’s Quickassist (QAT) accelerator line. Intel says the LCE features a number of upgrades over QAT targeted towards IPU use cases. But perhaps the most notable upgrade is getting asymmetric crypto support, which was conspicuously missing from Mount Evans’s LCE block. Asymmetric cryptography algorithms like RSA and ECDHE are used in TLS handshakes, and aren’t accelerated by special instructions on many server CPUs. Therefore, asymmetric crypto can consume significant CPU power when a server handles many connections per second. It was a compelling use case for QAT, and it’s great to see Mount Morgan get that as well. The LCE block also supports symmetric crypto and compression algorithms, capabilities inherited from QAT.

A programmable DMA engine in the LCE lets cloud providers move data as part of hardware accelerated workflows. Intel gives an example workflow for accessing remote storage, where the LCE helps move, compress, and encrypt data. Other accelerator blocks located in the IPU’s network subsystem help complete the process.

Network Subsystem

Networking bandwidth and offloads are a core function of the IPU, and its importance can’t be understated. Cloud servers need high network and storage bandwidth. The two are often two sides of the same coin, because cloud providers might use separate storage servers accessed over datacenter networking. Mount Morgan has 400 Gbps of Ethernet throughput, double Mount Evans’s 200 Gbps.

True to its smart NIC lineage, Mount Morgan uses a large number of inline accelerators to handle cloud networking tasks. A programmable P4-based packet processing pipeline, called the FXP, sits at the heart of the network subsystem. P4 is a packet processing language that lets developers express how they want packets handled. Hardware blocks within the FXP pipeline closely match P4 demands. A parser decodes packet headers and translates the packet into a representation understood by downstream stages. Downstream stages can check for exact or wildcard matches. Longest prefix matches can be carried out in hardware too, which is useful for routing.

The FXP can handle a packet every cycle, and can be configured to perform multiple passes per packet. Intel gives an example where one pass processes outer packet layers to perform decapsulation and checks against access control lists. A second pass can look at the inner packet, and carry out connection tracking or implement firewall rules.

An inline crypto block sits within the network subsystem as well. Unlike the LCE in the compute complex, this crypto block is dedicated to packet processing and focuses on symmetric cryptography. It includes its own packet parsers, letting it terminate IPSec and PSP connections and carry out IPSec/PSP functions like anti-replay window protection, sequence number generation, and error checking in hardware. IPSec is used for VPN connections, which are vital for letting customers connect to cloud services. PSP is Google’s protocol for encrypting data transfers internal to Google’s cloud. Compared to Mount Evans, the crypto block’s throughput has been doubled to support 400 Gbps, and supports 64 million flows.

Cloud providers have to handle customer network traffic while ensuring fairness. Customers only pay for a provisioned amount of network bandwidth. Furthermore, customer traffic can’t be allowed to monopolize the network and cause problems with infrastructure services. The IPU has a traffic shaper block, letting it carry out quality of service measures completely in hardware. One mode uses a mutli-level hierarchical scheduler to arbitrate between packets based on source port, destination port, and traffic class. Another “timing wheel” mode does per-flow packet pacing, which can be controlled by classification rules set up at the FXP. Intel says the timing wheel mode gives a pacing resolution of 512 nanoseconds per slot.

RDMA traffic accounts for a significant portion of datacenter traffic. For example, Azure says RDMA accounts for 70% of intra-cloud network traffic, and is used for disk IO. Mount Morgan has a RDMA transport option to provide hardware offload for that traffic. It can support two million queue pairs across multiple hosts, and can expose 1K virtual functions per host. The latter should let a cloud provider directly expose RDMA acceleration capabilities to VMs. To ensure reliable transport, the RDMA transport engine supports the Falcon and Swift transport protocols. Both protocols offer improvements over TCP, and Intel implements congestion control for those protocols completely in hardware. To reduce latency, the RDMA block can bypass the packet processing pipeline and handle RDMA connections on its own.

All of the accelerator blocks above are clients of the system level cache. Some hardware acceleration use cases, like connection tracking with millions of flows, can have significant memory footprints. The system level cache should let the IPU keep frequently accessed portions of accelerator memory structures on-chip, reducing DRAM bandwidth needs.

Host Fabric and PCIe Switch

Mount Morgan’s PCIe capabilities have grown far beyond what a normal network card may offer. It has 32 PCIe Gen 5 lanes, providing more IO bandwidth than some recent desktop CPUs. It’s also a huge upgrade over the 16 PCIe Gen 4 lanes in Mount Evans.

Traditionally, a network card sits downstream of a host, and thus appears as a device attached to a server. The host fabric and PCIe subsystem is flexible to let the IPU wear many hats. It can appear as a downstream device to up to four server hosts, each of which sees the IPU as a separate, independent device. Mount Evans supported this “multi-host” mode as well, but Mount Morgan’s higher PCIe bandwidth is necessary to utilize its 400 Gigabit networking.

Mount Morgan can run in a “headless” mode, where it acts as a standalone server and a lightweight alternative to dedicating a traditional server to infrastructure tasks. In this mode, Mount Morgan’s 32 PCIe lanes can let it connect to many SSDs and other devices. The IPU’s accelerators as well as the PCIe lanes appear downstream of the IPU’s CPU cores, which act as a host CPU.

A “converged” mode can use some PCIe lanes to connect to upstream server hosts, while other lanes connect to downstream devices. In this mode, the IPU shows up as a PCIe switch to connected hosts, with downstream devices visible behind it. A server could connect to SSDs and GPUs through the IPU. The IPU’s CPU cores can sit on top of the PCIe switch and access downstream devices, or can be exposed as a downstream device behind the PCIe switch.

The IPU’s multiple modes are a showcase of IO flexibility. It’s a bit like how AMD uses the same die as an IO die within the CPU and a part of the motherboard chipset on AM4 platforms. The IO die’s PCIe lanes can connect to downstream devices when it’s serving within the CPU, or be split between an upstream host and downstream devices when used in the chipset. Intel is also no stranger to PCIe configurability. Their early QAT PCIe cards reused their Lewisburg chipset, exposing it as a downstream device with three QAT devices appearing behind a PCIe switch.

Final Words

Cloud computing plays a huge role in the tech world today. It originally started with commodity hardware, with similar server configurations to what customers might deploy in on-premise environments. But as cloud computing expanded, cloud providers started to see use cases for cloud-specific hardware accelerators. Examples include "Nitro" cards in Amazon Web Services, or smart NICs with FPGAs in Microsoft Azure. Intel has no doubt seen this trend, and IPUs are the company's answer.

Mount Morgan tries to service all kinds of cloud acceleration needs by packing an incredible number of highly configurable accelerators, in recognition of cloud providers’ diverse and changing needs. Hardware acceleration always runs the danger of becoming obsolete as protocols change. Intel tries to avoid this by having very generalized accelerators, like the FXP, as well as packing in CPU cores that can run just about anything under the sun. The latter feels like overkill for infrastructure tasks, and could let the IPU remain relevant even if some acceleration capabilities become obsolete.

At a higher level, IPUs like Mount Morgan show that Intel still has ambitions to stretch beyond its core CPU market. Developing Mount Morgan must have been a complex endeavor. It’s a showcase of Intel’s engineering capability even when their CPU side goes through a bit of a rough spot. It’ll be interesting to see whether Intel’s IPUs can gain ground in the cloud market, especially with providers that have already developed in-house hardware offload capabilities tailored to their requirements.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.