MoreRSS

site iconChips and CheeseModify

Deep dives into computer hardware and software and the wider industry...
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Chips and Cheese

Arm’s Bifrost Architecture and the Mali-G52

2025-05-10 11:50:53

Arm (the company) is best known for its Cortex CPU line. But Arm today has expanded to offer a variety of licensable IP blocks, ranging from interconnects to IOMMUs to GPUs. GPUs make for an interesting discussion topic because they’ve evolved to become highly programmable and complex components, much like CPUs. Like CPU performance, GPU performance is highly visible to users, and forms an important part of a device’s specifications.

Arm Mali GPUs target low power and embedded devices, a characteristic shared with Arm’s Cortex CPUs. As a GPU, Mali tackles the same fundamental problems as high performance discrete GPUs that gamers and PC enthusiasts are familiar with. Graphics processing has plenty of inherent parallelism; it maps well to hardware that can track a lot of parallel work, and map it to a wide array of execution units. However, power and area constraints force a delicate approach to exploiting parallelism. A low-end laptop GPU may have at most a dozen watts to work with; anything above 6W will likely be unsustainable in a cell phone or tablet. Die area constraints are similarly tight, because an iGPU has to share a small die alongside a CPU and numerous accelerator blocks.

In another difference from AMD, Intel, and Nvidia GPUs, Mali is licensed out as a standalone IP block; Arm does not control the chip design process. Instead, implementers buy Arm IP and bring together a wide variety of other IP blocks to meet their chip-level goals. This business model makes Mali peculiar in the GPU space. Mali only handles 3D rendering and parallel compute, doesn’t provide hardware acceleration for video codecs, and can’t even drive a display by itself. A PC enthusiast expects this functionality to come with any GPU. However, excluding it from the Mali package lets implementers pick and choose video and display engines to meet their needs. Hypothetically, an implementer could even go without video and display engines, and use Mali purely as compute accelerator. Lack of control over the chip design process creates further challenges: Mali has to perform well across the widest possible range of use cases to increase Arm’s customer base, yet Arm has to do so with no control over the all-important chip-level memory subsystem.

Bifrost is Arm’s second generation unified shader architecture from around 2016. It comes after Midgard, which brought unified shaders to Mali well after AMD, Intel, and Nvidia did so for their GPU lines. For this article I’ll use data from the Mali-G52, as implemented in the Amlogic S922X. The Mali-G52 is a very small GPU, so I’ll also use comparison data from Qualcomm’s Adreno 615, as implemented in the Snapdragon 670.

Software APIs

GPU programming interfaces saw enormous change in the late 2010s and early 2010s. Graphics APIs moved to a unified shader model, where different shader stages run on the same execution pipelines. GPU compute rose quickly as those execution pipelines became increasingly flexible. Arm’s Midgard got onboard the programmability train with OpenGL ES 3.0, Vulkan, and OpenCL 1.1 support. While Midgard could handle modern APIs, its VLIW4 setup had brittle performance characteristics. Arm’s compiler could be hard pressed to extract enough instruction level parallelism to fill a VLIW4 bundle, especially with compute code. Even in graphics code, Arm noted that 3-wide vectors were very common and could leave one VLIW4 component unused.

Bifrost switches to a scalar, dual issue execution model to address Midgard’s shortcomings. From a single thread’s point of view, registers are now 32-bits wide rather than 4×32-bit vectors. Instead of having one thread issue operations to fill four FP32 lanes, Bifrost relies on multiple API threads to fill an Execution Engine’s four or eight lanes. A lane in Bifrost feeds both a FMA and FADD execution pipeline, so Bifrost still benefits from instruction level parallelism. However, packing two operations into an instruction should be easier than four. As a result, Arm hopes to achieve more consistent performance with a simpler compiler.

Arm’s move from SIMD to scalar execution within each API thread with Bifrost parallels AMD’s Terascale to GCN transition, aiming for more consistent performance across compute workloads.

GPU Organization

Midgard and Bifrost may be vastly different at the execution pipeline level, but the two share a similar high level organization. Bifrost Execution Engines (EEs) contain execution pipelines and register files, and act as a replacement for Midgard’s arithmetic pipelines. Looking outside Arm, EEs are the rough equivalent of Intel Execution Units (EUs) or AMD SIMDs.

Multiple EEs live in a Shader Core (SC). A messaging fabric internal to the Shader Core links EEs to memory pipelines and other shared fixed function hardware. A SC’s texture and load/store units include first level caches, making them a close equivalent to Intel’s subslices, AMD’s CUs or WGPs, or Nvidia’s SMs. One difference is that Bifrost places pixel backends (ROPs) at the Shader Core level, while desktop GPU architectures place them at a higher level subdivision. Bifrost doesn’t have another subdivision level beyond the Shader Core.

A defining Mali characteristic is an extraordinary number of levers for tuning GPU size. Besides adjusting Shader Core count, Arm can adjust a Shader Core’s EE count, cache sizes, and ROP/TMU throughput.

Arm’s Mali-G51 can have a single-EE SC with 1 TMU/ROP, or up to three 2 TMU/ROP SCs with a triple EE setup

Flexibility extends to the EEs, which can operate on 4- or 8-wide warps, with correspondingly wide execution pipelines. Arm can therefore finely adjust GPU size in multiple dimensions to precisely target performance, power, and area goals. By comparison, AMD and Nvidia generally use the same WGP/SM structure, from integrated GPUs all the way to 300W+ monsters.

Visualizing the Mali-G52 in the Amlogic S922X

Bifrost can theoretically scale to 32 Shader Cores. Doing so with triple-EE SCs would provide 1.23 TFLOPS of FP32 FMA performanec at 800 MHz, which is in the same ballpark as Intel’s largest Skylake GT4e configuration. It’s not high power or high performance by discrete GPU standards, but well above what would fit in an average cell phone or tablet. The Mali-G52 in the Amlogic S922X is a small Bifrost configuration, with two triple-EE SCs running at 800 MHz, each EE 8-wide. Qualcomm’s Adreno can scale by varying Shader Processor count and uSPTP size. Adreno 6xx’s execution unit partitions are much larger at either 64- or 128-wide.

Visualizing Adreno 615 in the S670. Not sure what ROP throughput is like, so I’ve drawn a single red bar

Shader Cores across a Bifrost GPU share a L2 cache. A standard ACE memory bus connects Bifrost to the rest of the system, and Arm’s influence ends at that point.

Shader Frontend

Bifrost’s instruction cache capacity hasn’t been disclosed, but instruction throughput is highest in a loop with 512 or fewer FP adds; as the loop body exceeds 1280 FP adds, instruction throughput takes another dip. Bifrost uses 78-bit instructions, which specify two operations corresponding to the FMA and FADD pipes. Arm’s compiler can issue FP adds down both pipes. Compiled binary size increases by 6-7 bytes for each FP add statement, with FMA+FADD packing decreasing size and clause/quadword header adding overhead. Based on binary size increase over a baseline, instruction cache capacity is possibly around 8 KB, which would put it in line with Qualcomm’s Adreno 6xx.

Each Execution Engine tracks state for up to 16 warps, each of which corresponds to a vector of eight API threads executing in lockstep. Hardware switches between warps to hide latency, much like SMT on a CPU. Bifrost uses a clause-based ISA to simplify scheduling. Instructions packed into a clause execute atomically, and architectural state is only well-defined between clauses. Only one instruction in a clause can access long- and variable-latency units outside the EU, such as memory pipelines.

Memory dependencies are managed between clauses, so an instruction that needs data from a memory access must go into a separate clause. A 6-entry software managed scoreboard specifies cross-clause dependencies. Clauses reduce pressure on scheduling hardware, which only has to consult the scoreboard at clause boundaries rather than with each instruction. Bifrost has parallels to AMD’s Terascale, which also uses clauses, though their implementation details differ. Terascale groups instructions into clauses based on type; for example, math instructions go into an ALU clause, and memory accesses go into separate texture- or vertex-fetch clauses.

From a programmer’s perspective, Mali-G52 can theoretically have 768 active workitems across the GPU; that’s eight lanes per warp * 16 warps per EE * 6 EEs across the GPU. In practice, actual active thread count, or occupancy, can vary depending on available parallelism and register usage; Bifrost’s ISA provides up to 64 registers, but using more than 32 will halve theoretical occupancy (implying 16 KB of register file capacity). There are no intermediate allocation steps. For comparison, Qualcomm’s Adreno 6xx can only achieve maximum occupancy with 12 registers per thread.

Register Access and Execution Pipelines

Instruction execution is split into register access, FMA, and FADD stages. During the register access stage, a Bifrost instruction takes direct control of the operand collector to read instruction inputs and write results from the prior instruction. Each EE’s register file has four ports; two can handle reads, one can handle writes, and one can handle either. Feeding the EE’s FMA and FADD pipes would nominally require six inputs, so register read bandwidth is very limited. If a prior instruction wants to write results from both the FMA and FADD pipes, register bandwidth constraints only become more severe.

To alleviate register file bandwidth demands, Bifrost can source inputs from a uniform/constant port, which provides 1×64 bits or 2×32 bits of data from immediate values embedded into a clause. Additionally, Bifrost provides “temporary registers”, which are really software-controlled forwarding paths that hold results from the prior instruction. Finally, because the FADD unit is placed at a later pipeline stage than the FMA unit, the FADD unit can use the FMA unit’s result as an input.

Bifrost’s temporary registers, software-managed operand collector, and register bandwidth constraints will be immediately familiar to enjoyers of AMD’s Terascale 2 architecture. Terascale 2 uses 12 register file inputs to feed five VLIW lanes, which could need up to 15 inputs. Just like Bifrost, AMD’s compiler uses a combination of register reuse, temporary registers (PV/PS), and constant reads to keep the execution units fed. Like Bifrost, PV/PS are only valid between consecutive instructions in the same clause, and reduce both register bandwidth and allocation requirements. One difference is that result writeback on Terascale 2 doesn’t share register file bandwidth with reads, so using temporary registers isn’t as critical.

Bifrost’s execution pipelines have impressive flexibility when handling different data types and essentially maintain 256-bit vector execution (or 128-bit on 4-wide EE variants) with 32/16/8-bit data types. Machine learning research was already well underway in the years leading up to Bifrost, and rather than going all-in as Nvidia did with Volta’s dedicated matrix multiplication units, Arm made sure the vector execution units could scale throughput with lower-precision types.

Bifrost stops taking divergence penalties when blocks of 8 API threads take the same branch direction, while Adreno 6xx wants 64 API threads to go the same way

Qualcomm’s Adreno 615 takes a different execution strategy, with 64-wide warps and correspondingly wide execution units. That makes Adreno 615 more prone to divergence penalties, but lets Qualcomm control more parallel execution units with one instruction. Adreno 615 has 128 FP32 lanes across the GPU, all capable of multiply-adds, and runs them at a very low 430 MHz. Mali-G52 can only do 48 FP32 FMA operations per clock, but can complete 96 FP32 operations per clock with FMA+FADD dual issue. Combined with a higher 800 MHz clock speed, Mali-G52 can provide similar FP add throughput to Adreno 615. However, Adreno 615 fares better with multiply-adds, and can reach just over 100 GFLOPS.

I’m being specific by saying multiply-adds, not fused multiply-adds; the latter rounds only once after the multiply and add (both computed with higher intermediate precision), which improves accuracy. Adreno apparently has no fast-path FMA hardware, and demanding FMA accuracy (via OpenCL’s fma function) requires over 600 cycles per FMA per warp. Bifrost handles FMA with no issues. Both mobile GPUs shine with FP16, which executes at double rate compared to FP32.

Special functions like inverse square roots execute down Bifrost’s FADD pipeline, at half rate compared to basic operations (or quarter rate if considering FMA+FADD dual issue). Arm has optimized handling for such complex operations on Bifrost compared to Midgard. Only the most common built-in functions exposed in APIs like OpenCL get handled with a single instruction. More complex special operations take multiple instructions. Adreno executes special functions at a lower 1/8 rate.

Integer operations see a similar split on Bifrost as FP ones. Integer adds can execute down the FADD pipe, while multiplies use the FMA pipe. Adreno’s more uniform setup gives it an advantage for adds; both of Bifrost’s pipes can handle integer operations at full rate, giving Bifrost an advantage for integer multiplies.

Lower precision INT8 operations see excellent throughput on Bifrost, but suffer glass jaw behavior on Adreno. Clearly Qualcomm didn’t implement fast-path INT8 hardware, but an INT8 operation can be carried out on INT32 units with the result masked to 8 bits. Terascale 2 also lacks INT8 hardware, but can emulate them at just under half rate. FP64 support is absent on both mobile GPUs.

Bifrost’s execution engines are built for maximum flexibility, and handle a wider range of operations with decent performance than Qualcomm’s Adreno. Adreno, by comparison, appears tightly optimized for graphics. Graphics rasterization doesn’t need higher precision from fused multiply-adds, nor do they need lower precision INT8 operations. Qualcomm packs a variety of other accelerators onto Snapdragon chips, which could explain why they don’t feel the need for consistently high GPU performance across such a wide range of use cases. Arm’s licensing business model means they can’t rely on the chip including other accelerators, and Bifrost’s design reflects that.

Memory Pipeline

Bifrost’s memory subsystem includes separate texture and load/store paths, each with their own caches. EEs access these memory pipelines through the Shader Core’s messaging network. Intel uses similar terminology, with EUs accessing memory by sending messages across the subslice’s internal messaging fabric. AMD’s CUs/WGPs and Nvidia’s SMs have some sort of interconnect to link execution unit partitions to shared memory pipelines; Arm and Intel’s intra-core networks may be more flexible still, since they allow for variable numbers of execution unit partitions:

Arm’s documentation states Mali-G52’s load/store cache and texture caches are both 16 KB. However, latency testing suggests the texture cache is 8 KB. These parameters may be configurable on implementer request, and indeed vary across different Bifrost SKUs. For example, Mali-G71, a first-generation Bifrost variant, has 16 KB and 8 KB load/store and texture caches respectively on the spec sheet.

Bifrost only supports a 1D texture sizes up to 64 KB

Pointer chasing in the texture cache carries slightly higher latency than doing so in the load/store cache, which isn’t surprising. However it’s worth noting some GPUs, like AMD’s Terascale, have TMUs that can carry out indexed addressing faster than doing the same calculation in the programmable shader execution units.

Texture cache bandwidth is low on Bifrost compared to Adreno and Intel’s Gen 9. OpenCL’s read_imageui function returns a vector four 32-bit integers, which can be counted as a sample of sorts. Mali-G52 can deliver 26.05 bytes per Shader Core cycle via read_imageui, consistent with Arm’s documentation which states a large Shader Core variant can do two samples per clock. Adreno 615 achieves 61.3 bytes per uSPTP cycle, or four samples. That’s enough to put Adreno 615 ahead despite its low clock speed. I suppose Qualcomm decided to optimize Adreno’s texture pipeline for throughput rather than caching capacity, because its 1 KB texture cache is small by any standard.

It’s amazing how little cache bandwidth these mobile GPUs have. Even on the same chip, the Amlogic S922X’s four A73 cores can hit 120 GB/s of L1D read bandwidth. Global memory bandwidth for compute applications is similarly limited. Adreno 615 and Mali-G52 are about evenly matched. A Bifrost SC delivered 16 bytes per cycle in a global memory bandwidth test, while an Adreno 615 uSPTP can load 32 bytes per cycle from L2.

Bifrost’s L1 can likely deliver 32 bytes per cycle, matching the texture cache, because I can get that from local memory using float4 loads. I don’t have a float4 version of my global memory bandwidth test written yet, but results from testing local memory should suffice:

Bifrost metaphorically drops local memory on the ground. GPU programming APIs provide a workgroup-local memory space, called local memory in OpenCL or Shared Memory in Vulkan. GPU hardware usually backs this with dedicated on-chip storage. Examples include AMD’s Local Data Share, and Nvidia/Intel reserving a portion of cache to use as local memory.

Mali GPUs do not implement dedicated on-chip shared memory for compute shaders; shared memory is simply system RAM backed by the load-store cache just like any other memory type

Arm Mali GPUs Best Practices Developer Guide

Bifrost doesn’t give local memory any special treatment. An OpenCL kernel can allocate up to 32 KB of local memory, but accesses to local memory aren’t guaranteed to remain on-chip. Worse, each Shader Core can only have one workgroup with local memory allocated, even if that workgroup doesn’t need all 32 KB.

GPUs that back local memory with on-chip storage can achieve better latency than Bifrost; that includes Adreno. Qualcomm disclosed that Adreno X1 allocates local memory out of GMEM, and their prior Adreno architectures likely did the same. However, Qualcomm doesn’t necessarily enjoy a bandwidth advantage, because GMEM access similarly appears limited to 32 bytes per cycle.

L2 Cache and System Level

Bifrost’s L2 functionally works like the L2 cache in modern AMD and Nvidia GPUs. It’s a write-back cache built from multiple slices for scalability. Arm expects Bifrost implementations to have 64-128 KB of L2 per Shader Core, up from Midgard’s corresponding 32-64 KB figure. Amlogic has chosen the lower end option, so the Mali-G52 has 128 KB of L2. A hypothetical 32 Shader Core Bifrost GPU may have 2-4 MB of L2.

On the Amlogic S922X, L2 latency from the texture side is slightly better than on Qualcomm’s Adreno 615. However, Adreno 615 enjoys better L2 latency for global memory accesses, because it doesn’t need to check L1 on the way. L2 bandwidth appears somewhat lower than Adreno 615, though Mali-G52 has twice as much L2 capacity at 128 KB. However the Snapdragon 670 has a 1 MB system level cache, which likely mitigates the downsides of a smaller GPU-side L2.

Bifrost shows reasonably good latency when using atomic compare and exchange operations to pass data between threads. It’s faster than Adreno 615 when using global memory, though Qualcomm offers lower latency if you use atomics on local memory.

GPUs often handle atomic operations using dedicated ALUs close to L2 or backing storage for local memory. Throughput for INT32 atomic adds isn’t great compared to Intel, and is very low next to contemporary discrete GPUs from AMD and Nvidia.

L2 misses head out to the on-chip network, and make their way to the DRAM controller. Amlogic has chosen a a 32-bit DDR4-2640 interface for the S922X, which provides 10.56 GB/s of theoretical bandwidth. A chip’s system level architecture is ultimately up to the implementer, not Arm, which can affect Bifrost’s system level feature support.

In alignment with Bifrost’s compute aspirations, Arm designed a L2 that can accept incoming snoops. With a compatible on-chip interconnect and CPU complex, Bifrost can support OpenCL Shared Virtual Memory with fine-grained buffer sharing. That lets the CPU and GPU share data without explicit copying or map/unmap operations. Evidently Amlogic’s setup isn’t compatible, and Mali-G52 only supports coarse-grained buffer sharing. Worse, it appears to copy entire buffers under the hood with map/unmap operations.

Qualcomm on the other hand controls chip design from start to finish; Adreno 615 supports zero-copy behavior, and Qualcomm’s on-chip network has the features needed to make that happen.

While many modern GPUs can support zero-copy data sharing with the CPU, copy performance can still matter. Besides acting as a baseline way of getting data to the GPU and retrieving results, buffers can be de-allocated or reused once their data is copied to the GPU. Copy bandwidth between the host and GPU is low on the Amlogic S922X at just above 2 GB/s. A copy requires both a read and a write, so that would be 4 GB/s of bandwidth. That’s in line with measured global memory bandwidth above, and suggests the DMA units aren’t any better than the shader array when it comes to accessing DRAM bandwidth.

Adreno 615 has better copy performance, possibly helped by its faster LPDDR4X DRAM interface. However, copying data back from the GPU is agonizingly slow. Games largely transfer data to the GPU, not the other way around, so that’s another sign that Adreno is tightly optimized for gaming.

Rasterization and Tiled Rendering

Adreno and Mali share another common problem: mobile devices can’t afford high bandwidth DRAM interfaces compared to desktop CPUs, let alone GPUs with wide GDDR setups. However, graphics rasterization can be bandwidth-hungry, and ROPs can be a significant source of bandwidth pressure. When pixel/fragment shaders output pixel colors to the ROPs, the ROPs have to ensure those results are written in the correct order. That can involve depth testing or alpha blending, as specified by the application.

Both Adreno 6xx and Bifrost reduce ROP-side DRAM traffic using tiled rendering, which splits the screen into rectangular tiles, and renders them one at a time. Tiles are sized to fit within on-chip buffers, which contain intermediate accesses from alpha blending and depth testing. The tile is only written out to DRAM after it’s finished rendering. Tiled rendering requires building per-tile visible triangle lists as vertex shaders finish calculating vertex coordinates. Then, those triangle lists are read back as the GPU rasterizes tiles one by one. Handling triangle lists generates DRAM traffic, which could obviate the benefit of tiled rendering if not handled carefully.

Bifrost uses a hierarchical tiling strategy like Midgard. Tiles are nominally 16×16 pixels, but Arm can use larger power-of-two tile sizes to try containing triangles. Doing so reduces how often a triangle overlaps different tiles and thus is referenced in multiple triangle lists. Compared to Midgard, Arm also redesigned tiler memory structures with finer-grained allocations and no minimum buffer allocations. Finally, Bifrost can eliminate triangles too small to affect pixel output at the tiler stage, which also reduces wasted pixel/fragment shader work. These optimizations can reduce both bandwidth usage and memory footprint between the vertex and pixel shader stages. Arm also optimizes bandwidth usage at the tile writeback stage, where “Transaction Elimination” compares a tile’s CRC with the corresponding tile in the prior frame, and skips tile writeback if they match, an often efficient trade of logic for memory bus usage.

Because Bifrost uses 256 bits of tile storage per pixel, tile memory likely has at least 8 KB of capacity. Arm further implies tile memory is attached to each Shader Core, so Mali-G52 may have 16 KB of tile memory across its two Shader Cores. Adreno 615 also uses tiled rendering, and uses 512 KB of tile memory (called GMEM) to hold intermediate tile state.

Compute Performance: FluidX3D

FluidX3D is a GPGPU compute application that simulates fluid behavior. FP16S/FP16C modes help reduce DRAM bandwidth requirements by using 16-bit FP formats for storage. Calculations are still performed on FP32 values to maintain accuracy, with extra instructions used to convert between 16 and 32-bit FP formats.

Adreno 615 and Mali-G52 both appear more compute-bound than bandwidth-bound, so FP16 formats don’t help. FluidX3D uses FMA operations by default, which destroys Adreno 615’s performance because it doesn’t have fast-path FMA hardware. Qualcomm does better if FMA operations are replaced by multiply-adds. However, Adreno 615 still turns in an unimpressive result. Despite having more FP32 throughput and more memory bandwidth on paper, it falls behind Mali-G52.

Power Management

Mali-G52 is organized into four power domains. A “GL” domain is always on, and likely lets the GPU listen for power-on commands. Next, a “CG” (Common Graphics?) domain is powered on when the GPU needs to handle 3D work or parallel compute. Next, the Shader Cores (SC0, SC1) are powered on as necessary. Each Shader Core sits on a separate power domain, letting the driver partially power up Mali’s shader array for very light tasks.

Power savings can also come from adjusting clock speed. The Amlogic S922X appears to generate Mali-G52’s clocks from 2 GHz “FCLK”, using various divider settings.

Final Words

Arm’s business model relies on making its IP blocks attractive to the widest possible variety of implementers. Bifrost fits into that business model thanks to its highly parameterized design and very small building blocks, which makes it ideal for hitting very specific power, area, and performance levels within the low power iGPU segment.

Visualizing a hypothetical 32 Shader Core Bifrost GPU

Qualcomm’s Adreno targets similar devices, but uses much larger building blocks. Adreno’s strategy is better suited to scaling up the architecture, which aligns with Qualcomm’s ambition to break into the laptop segment. However, larger building blocks make it more difficult to create tiny GPUs. I feel like Qualcomm scaled down Adreno 6xx with all the knobs they had, then had to cut clock speed to an incredibly low 430 MHz to hit the Snapdragon 670’s design goals.

Visualizing Adreno 690, a large implementation of the Adreno 6xx architecture. Again not sure how many ROPs there are per SP, so theres’s a single red bar there as a placeholder

Beyond scaling flexibility, Bifrost widens Arm’s market by aiming for consistent performance across a wide range of workloads. That applies both in comparison to Qualcomm’s Adreno, as well as Arm’s prior Midgard architecture. Bifrost’s execution units are pleasantly free of “glass jaw” performance characteristics, and GPU to CPU copy bandwidth is better than on Adreno. Adreno isn’t marketed as a standalone block, and is more sharply focused on graphics rasterization. Qualcomm might expect common mobile compute tasks to be offloaded to other blocks, such as their Hexagon DSP.

Overall, Bifrost is an interesting look into optimizing a low power GPU design to cover a wide range of applications. It has strong Terascale to GCN energy, though Bifrost really lands at a midpoint between those two extremes. It’s fascinating to see Terascale features like clause-based execution, a software-controlled operand collector, and temporary registers show up on a 2016 GPU architecture. Apparently features that Terascale used to pack teraflops of FP32 compute into 40nm class nodes continue to be useful for hitting very tight power targets on newer nodes. Since Bifrost, Arm has continued to modernize their GPU architecture with a focus on both graphics rasterization and general purpose compute. They’re a fascinating and unique GPU designer, and I look forward to seeing where they go in the future.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Zhaoxin’s KX-7000

2025-05-01 04:20:16

Zhaoxin is a Chinese x86 CPU designer. The KaiXian KX-7000 is Zhaoxin’s latest CPU, and features a new architecture dubbed “世纪大道”. 世纪大道 is a road in Shanghai called “Century Avenue”, following Zhaoxin’s practice of naming architectures after Shanghai landmarks. Zhaoxin is notable because it’s a joint venture between VIA Technologies and the Shanghai municipal government. It inherits VIA’s x86-64 license, and also enjoys powerful government backing. That’s a potent combination, because Zhaoxin’s cores are positioned to take advantage of the strong x86-64 software ecosystem.

x86-64 compatibility is just one part of the picture, because performance matters too. Zhaoxin’s previous LuJiaZui, implemented in the KX-6640MA, was clearly inadequate for handling modern applications. LuJiaZui was a 2-wide core with sub-3 GHz clock speeds and barely more reordering capacity than Intel’s Pentium II from 1997. Century Avenue takes aim at that performance problem.

Core Overview

Century Avenue is a 4-wide, AVX2 capable core with an out-of-order execution window on par with Intel CPUs from the early 2010s. Besides making the core wider and more latency-tolerant, Zhaoxin targets higher clock speeds. The KX-7000 runs at 3.2 GHz, significantly faster than the KX-6640MA’s 2.6 GHz. Zhaoxin’s site claims the KX-7000 can reach 3.5-3.7 GHz, but I never saw the chip clock above 3.2 GHz.

The KX-7000 has eight Century Avenue cores, and uses a chiplet setup reminiscent of single-CCD AMD Ryzen desktop parts. All eight cores sit on a die and share 32 MB of L3 cache. A second IO die connects to DRAM and other IO. Zhaoxin did not specify what process node they’re using. Techpowerup and Wccftech suggests it uses an unspecified 16nm node.

Frontend

At the frontend, instructions are fetched from a 64 KB 16-way instruction cache. The instruction cache can deliver 16 bytes per cycle, and feeds a 4-wide decoder. Century Avenue uses a thoroughly conventional frontend setup, without a loop buffer or op cache. Instruction cache bandwidth can therefore constrain frontend throughput if average instruction length exceeds 4 bytes.

Frontend bandwidth drops sharply as code spills out of L1i, creating another contrast with 2010s era western designs. Skylake for example can run code from L2 at over 12 bytes per cycle, adequate for >3 IPC with 4 byte instructions. Century Avenue suffers further if code spills into L3, where frontend bandwidth drops to under 4 bytes per cycle.

A 4096 entry branch target buffer (BTB) provides branch targets, and creates two pipeline bubbles after a taken branch. Taken branch latency jumps as the test spills out of L1i, even with far fewer than 4K branches. The BTB is likely tied to the L1i, and thus can’t be used to do long-distance prefetch past a L1i miss.

Century Avenue’s branching performance is reminiscent of older cores like VIA’s Nano. Dropping zero-bubble branching capability is a regression compared to LuJiaZui, which could do so from a small 16 entry L0 BTB. Perhaps Zhaoxin felt they couldn’t do zero-bubble branching at Century Avenue’s 3 GHz+ clock speed targets. However Intel and AMD CPUs from over a decade ago have faster branch target caching at higher clock speeds.

In Century Avenue’s favor, the direction predictor has vastly improved pattern recognition capabilities compared to its predecessor. When given repeating patterns of taken and not-taken branches, the KX-7000 handles a bit like Intel’s Sunny Cove.

Returns behave much like on LuJiaZui. Call+return pairs enjoy reasonable latency until they go more than four-deep. An inflection point further on suggests a second level return stack with approximately 32 entries. If there is a second level return stack, it’s rather slow with a cost of 14 cycles per call+return pair. Bulldozer shows more typical behavior. Call+return pairs are fast until they overflow a 24 entry return stack.

Century Avenue’s frontend aims to deliver up to four instructions per cycle with minimal sophistication. A conventional fetch and decode setup can be good if tuned properly, but Century Avenue’s frontend has a few clear weaknesses. Average instruction length can exceed 4 bytes in AVX2 code, thanks to VEX prefixes. AMD tackled this by increasing L1i bandwidth to 32B/cycle in 10h CPUs. Intel used loop buffers in Core 2 before introducing an op cache in Sandy Bridge (while keeping 16B/cycle L1i bandwidth). Either approach is fine, but Century Avenue does neither. Century Avenue also does not implement branch fusion, a technique that AMD and Intel have used for over a decade. An [add, add, cmp, jz] sequence executes at under 3 IPC.

Lack of sophistication extends to branch target caching. A single level BTB with effectively 3 cycle latency feels primitive today, especially when it’s tied to the instruction cache. As before, a decoupled BTB isn’t the only way to go. Apple’s M1 also appears to have a BTB coupled to the L1i, but it compensates with a massive 192 KB L1i. Century Avenue’s 64 KB L1i is larger than the 32 KB instruction caches found on many x86-64 cores, but it stops short of brute-forcing its way around large code footprints the way Apple does. To be fair to Zhaoxin, Bulldozer also combines a 64 KB L1i with poor L2 code bandwidth. However, I don’t think there’s a good excuse for 3 cycle taken branch latency on any post-2024 core, especially one running below 4 GHz.

Rename and Allocate

Micro-ops from the frontend are allocated into backend tracking structures, which carry out bookkeeping necessary for out-of-order execution. Register allocation goes hand-in-hand with register renaming, which breaks false dependencies by allocating a new physical register whenever an instruction writes to one. The rename/allocate stage is also a convenient place to carry out other optimizations and expose more parallelism to the backend.

Century Avenue recognizes zeroing idioms like XOR-ing a register with itself, and can tell the backend that such instructions are independent. However such XORs are still limited to three per cycle, suggesting they use an ALU port. The renamer also allocates a physical register to hold the result, even though it will always be zero. Move elimination works as well, though it’s also limited to three per cycle.

Out-of-Order Execution

Zhaoxin switches to a physical register file (PRF) based execution scheme, moving away from LuJiaZui’s ROB-based setup. Separate register files reduce data transfer within the core, and let designers scale ROB size independently of register file capacity. Both are significant advantages over LuJiaZui, and contribute to Century Avenue having several times as much reordering capacity. With a 192 entry ROB, Century Avenue has a theoretical out-of-order window on par with Intel’s Haswell, AMD’s Zen, and Centaur’s CNS. LuJiaZui’s 48 entry ROB is nowhere close.

Reorder buffer size only puts a cap on how far the backend can search ahead of a stalled instruction. Reordering capacity in practice is limited by whatever resource the core runs out of first, whether that be register files, memory ordering queues, or other structures. Century Avenue’s register files are smaller than Haswell or Zen’s, but the core can keep a reasonable number of branches and memory operations in flight.

Century Avenue has a semi-unified scheduler setup, shifting away from LuJiaZui’s distributed scheme. ALU, memory, and FP/vector operations each have a large scheduler with more than 40 entries. Branches appear to have their own scheduler, though maybe not a dedicated port. I wasn’t able to execute a not-taken jump alongside three integer adds in the same cycle. In any case, Century Avenue has fewer scheduling queues than its predecessor, despite having more execution ports. That makes tuning scheduler size easier, because there are fewer degrees of freedom.

Typically a unified scheduler can achieve similar performance to a distributed one with fewer total entries. An entry in a unified scheduling queue can hold a pending micro-op for any of the scheduler’s ports. That reduces the chance of an individual queue filling up and blocking further incoming instructions even though scheduler entries are available in other queues. With several large multi-ported schedulers, Century Avenue has more scheduler capacity than Haswell, Centaur CNS, or even Skylake.

Execution Units

Three ALU pipes generate results for scalar integer operations. Thus Century Avenue joins Arm’s Neoverse N1 and Intel’s Sandy Bridge in having thee ALU ports in an overall four-wide core. Two of Century Avenue’s ALU pipes have integer multipliers. 64-bit integer multiplies have just two-cycle latency, giving the core excellent integer multiply performance.

Century Avenue’s FP/vector side is surprisingly powerful. The FP/vector unit appears to have four pipes, all of which can execute 128-bit vector integer adds. Floating point operations execute at two per cycle. Amazingly, that rate applies even for 256-bit vector FMA instructions. Century Avenue therefore matches Haswell’s per-cycle FLOP count. Floating point latency is normal at 3 cycles for FP adds and multiplies or 5 cycles for a fused multiply-add. Vector integer adds have single-cycle latency.

However, the rest of Century Avenue’s execution engine isn’t so enthusiastic about AVX2. Instructions that operate on 256-bit vectors are broken into two 128-bit micro-ops for all the common cases I tested. A 256-bit FP add takes two ROB entries, two scheduler slots, and the result consumes two register file entries. On the memory side, 256-bit loads and stores take two load queue or two store queue entries, respectively. Zhaoxin’s AVX2 approach is the opposite of Zen 4’s AVX-512 strategy: AMD left execution throughput largely unchanged from the prior generation, however, its 512-bit register file entries let it keep more work in flight and better feed those execution units. Century Avenue’s approach is to bring execution throughput first, and think about how to feed them later.

Core Memory Subsystem

Memory accesses start with a pair of address generation units (AGUs), which calculate virtual addresses. The AGUs are fed by 48 scheduler entries, which could be a 48 entry unified scheduler or two 24 entry queues.

48-bit virtual addresses from the AGUs are then translated into 46-bit physical addresses. Data-side address translations are cached in a 96 entry, 6-way set associative data TLB. 2 MB pages use a separate 32 entry, 4-way DTLB. Century Avenue doesn’t report L2 TLB capacity through CPUID, and DTLB misses add ~20 cycles of latency. That’s higher than usual for cores with a second level TLB, except for Bulldozer.

Besides address translation, the load/store unit has to handle memory dependencies. Century Avenue appears to do an initial dependency check using the virtual address, because a load has a false dependency on a store offset by 4 KB. For real dependencies, Century Avenue can do store forwarding with 5 cycle latency. Like many other cores, partial overlaps cause fast forwarding to fail. Century Avenue takes a 22 cycle penalty in that case, which isn’t out of the ordinary. For independent accesses, Century Avenue can do Core 2 style memory disambiguation. That lets a load execute ahead of a store with an unknown address, improving memory pipeline utilization.

“Misaligned” loads and stores that straddle a cacheline boundary take 12-13 cycles, a heavy penalty compared to modern cores. Skylake for example barely takes any penalty for misaligned loads, and handles misaligned stores with just a single cycle penalty. Century Avenue faces the heaviest penalties (>42 cycles) if a load depends on a misaligned store.

Core-Private Caches

Century Avenue has a 32 KB, 8-way associative data cache with a pair of 128-bit ports and 4 cycle load-to-use latency. Only one port handles stores, so 256-bit stores execute over two cycles. Century Avenue’s L1D bandwidth is therefore similar to Sandy Bridge, even though its FMA capability can demand higher bandwidth. When Intel first rolled out 2×256-bit FMA execution with Haswell, their engineers increased L1D bandwidth to 2×256-bit loads and a 256-bit store per cycle.

L2 latency is unimpressive at 15 cycles. Skylake-X has a larger 2 MB L2 for example, and ran that with 14 cycle latency at higher clock speeds.

Shared Cache and System Architecture

Century Avenue’s system architecture has been overhauled to improve core count scalability. The KX-7000 adopts a triple-level cache setup, aligning with high performance designs from AMD, Arm, and Intel. Core-private L2 caches help insulate L1 misses from high L3 latency. Thus L3 latency becomes less critical, which enables a larger L3 shared across more cores. Compared to LuJiaZui, Century Avenue increases L3 capacity by a factor of eight, going from 4 MB to 32 MB. Eight Century Avenue cores share the L3, while four LuJiaZui cores shared a 4 MB L2. Combined with the chiplet setup, the KX-7000 is built much like a single-CCD Zen 3 desktop part.

Unlike AMD’s recent designs, L3 latency is poor at over 27 ns, or over 80 core cycles. Bandwidth isn’t great either at just over 8 bytes per cycle. A read-modify-write pattern increases bandwidth to 11.5 bytes per cycle. Neither figure is impressive. Skylake could average 15 bytes per cycle from L3 using a read-only pattern, and recent AMD designs can achieve twice that.

The KX-7000 does enjoy good bandwidth scaling, but low clock speeds combined with low per-core bandwidth to start with mean final figures aren’t too impressive. A read-only pattern gets to 215 GB/s, while a read-modify-write pattern can exceed 300 GB/s. For comparison, a Zen 2 CCD enjoys more than twice as much L3 bandwidth.

The KX-7000 does have more L3 bandwidth than Intel’s Skylake-X, at least when testing with matched thread counts. However, Skylake-X has a larger 1 MB L2 cache to insulate the cores from poor L3 performance. Skylake-X is also a server-oriented part, where single-threaded performance is less important. On the client side, Bulldozer has similar L3 latency, but uses an even larger 2 MB to avoid hitting it.

DRAM Access

DRAM performance is poor, with over 200 ns latency even when using 2 MB pages to minimize address translation latency. Latency goes over 240 ns using 4 KB pages, using a 1 GB array in both cases. The KX-7000’s DRAM bandwidth situation is tricky. To start, the memory controller was only able to train to 1600 MT/s, despite using DIMMs with 2666 MT/s JEDEC and 4000 MT/s XMP profiles. Theoretical bandwidth is therefore limited to 25.6 GB/s. However measured read bandwidth gets nowhere close, struggling to get past even 12 GB/s.

Mixing in writes increases achievable bandwidth. A read-modify-write pattern gets over 20 GB/s, while non-temporal writes reach 23.35 GB/s. The latter figure is close to theoretical, and indicates Zhaoxin’s cross-die link has enough bandwidth to saturate the memory controller. Read bandwidth is likely limited by latency. Unlike writes, where data to be written gets handed off, reads can only complete when data returns. Maintaining high read bandwidth requires keeping enough memory requests in-flight to hide latency.

Often loading more cores lets the memory subsystem keep more requests in flight, because each core has its own L1 and L2 miss queues. However the KX-7000’s read bandwidth abruptly stops scaling once a bandwidth test loads more than two cores. That suggests a queue shared by all the cores doesn’t have enough entries to hide latency, resulting in low read bandwidth.

Taking the best latency/bandwidth combinations across different thread counts

To make things worse, the KX-7000’s memory subsystem doesn’t do well at ensuring fairness between requests coming from different cores. A pointer chasing thread sees latency skyrocket when other cores generate high bandwidth load. In a worst case with one latency test thread and seven bandwidth threads, latency pushes past 1 microsecond. I suspect the bandwidth-hungry threads monopolize entries in whatever shared queue limits read bandwidth.

AMD’s Bulldozer maintains better control over latency under high bandwidth load. The FX-8150’s Northbridge has a complicated setup with two crossbar levels, but does an excellent job. Latency increases as the test pushes up to the memory controller’s bandwidth limits, but doesn’t rise to more than double its un-loaded latency. In absolute terms, even Bulldozer’s worst case latency is better than the KX-7000’s best case.

Sometimes, the memory subsystem has to satisfy a request by retrieving data from a peer core’s cache. These cases are rare in practice, but can give insight into system topology. The KX-7000 posts relatively high but even latency in a core-to-core latency test. Some core pairs see lower latency than others, likely depending on which L3 slice the tested address belongs to.

Single Threaded Performance: SPEC CPU2017

Compared to LuJiaZui, Century Avenue posts a huge 48.8% gain in SPEC CPU2017’s integer suite, and provides more than a 2x speedup in the floating point suite. Zhaoxin has been busy over the past few years, and that work has paid off. Against high performance western x86-64 chips, the KX-7000 falls just short of AMD’s Bulldozer in the integer suite. The FX-8150 leads by 13.6% there. Zhaoxin flips things around in the floating point suite, drawing 10.4% ahead of Bulldozer.

Newer cores like Broadwell or Skylake land on a different performance planet compared to Century Avenue, so Bulldozer is the best relative comparison. Against Bulldozer, Century Avenue tends to do best in higher-IPC tests like 500.perlbench, 548.exchange2, and 525.x264. I suspect Century Avenue’s additional execution resources give it an advantage in those tests. Meanwhile Bulldozer bulldozes the KX-7000 in low IPC tests like 505.mcf and 520.omnetpp. Those tests present a nasty cocktail of difficult-to-predict branches and large memory footprints. Bulldozer’s comparatively strong memory subsystem and faster branch predictor likely give it a win there.

SPEC CPU2017’s floating point suite generally consists of higher IPC workloads, which hands the advantage to the KX-7000. However, the FX-8150 snatches occasional victories. 549.fotonik3d is a challenging low IPC workload that sees even recent cores heavily limited by cache misses. Bulldozer walks away with an impressive 46.2% lead in that workload. At the other end, 538.imagick basically doesn’t see L2 misses.

Overall the SPEC CPU2017 results suggest the KX-7000 can deliver single-threaded performance roughly on par with AMD’s Bulldozer.

Multithreaded Performance

Having eight cores is one of the KX-7000’s relative strengths against the FX-8150 and Core i5-6600K. However, multithreaded results are a mixed bag. libx264 software video encoding can take advantage of AVX2, and uses more than four threads. However, the KX-7000 is soundly beaten even by Bulldozer. 7-Zip compression uses scalar integer instructions. With AVX2 not playing a role, Bulldozer and the Core i5-6600K score even larger wins.

The KX-7000 turns in a better performance in Y-Cruncher, possibly with AVX2 giving it a large advantage over Bulldozer. However, eight Century Avenue cores still fail to match four Skylake ones. For a final test, OpenSSL RSA2048 signs are a purely integer operation that focuses on core compute power rather than memory access. They’re particularly important for web servers, which have to validate their identity when clients establish SSL/TLS connections. Zhaoxin again beats Bulldozer in that workload, but falls behind Skylake.

Final Words

Zhaoxin inherits VIA’s x86 license, but plays a different ball game. VIA focused on low-power, low-cost applications. While Centaur CNS did branch into somewhat higher performance targets with a 4-wide design, the company never sought to tap into the wider general purpose compute market like AMD and Intel. Creating a high-clocking, high-IPC core that excels in everything from web browsing to gaming to video encoding is a massive engineering challenge. VIA reasonably decided to find a niche, rather than take AMD and Intel head-on without the engineering resources to match.

However Zhaoxin is part of China’s effort to build domestic chips in case western ones become unavailable. Doing so is a matter of national importance, so companies like Zhaoxin can expect massive government support, and survive even without being profitable. Zhaoxin’s chips don’t need to directly compete with AMD and Intel. But AMD and Intel’s chips have driven performance expectations from application developers. China needs chips with enough performance to substitute western chips without being disruptively slow.

Century Avenue is an obvious attempt to get into that position, stepping away from LuJiaZui’s low power and low performance design. At a high level, Century Avenue represents good progress. A 4-wide >3 GHz core with Bulldozer-level performance is a huge step up. At a lower level, it feels like Zhaoxin tried to make everything bigger without slowing down and making sure the whole picture makes sense. Century Avenue has 2×256-bit FMA units, which suggest Zhaoxin is trying to get the most out of AVX2. However Century Avenue has low cache bandwidth and internally tracks 256-bit instructions as a pair of micro-ops. Doing so suits a minimum-cost AVX2 implementation geared towards compatibility rather than high performance. Besides AVX2, Century Avenue has small register files relative to its ROB capacity, which hinders its ability to make use of its theoretical out-of-order window.

Zooming out to the system level shows the same pattern. Century Avenue’s L2 is too small considering it has to shield cores from 80+ cycle L3 latency. The KX-7000’s DRAM read bandwidth is inadequate for an octa-core setup, and the memory subsystem does a poor job of ensuring fairness under high bandwidth load. Besides unbalanced characteristics, Century Avenue’s high frontend latency and lack of branch fusion make it feel like a 2005-era core, not a 2025 one.

Ultimately performance is what matters to an end-user. In that respect, the KX-7000 sometimes falls behind Bulldozer in multithreaded workloads. It’s disappointing from the perspective that Bulldozer is a 2011-era design, with pairs of hardware thread sharing a frontend and floating point unit. Single-threaded performance is similarly unimpressive. It roughly matches Bulldozer there, but the FX-8150’s single-threaded performance was one of its greatest weaknesses even back in 2011. But of course, the KX-7000 isn’t trying to impress western consumers. It’s trying to provide a usable experience without relying on foreign companies. In that respect, Bulldozer-level single-threaded performance is plenty. And while Century Avenue lacks the balance and sophistication that a modern AMD, Arm, or Intel core is likely to display, it’s a good step in Zhaoxin’s effort to break into higher performance targets.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

RDNA 4’s Raytracing Improvements

2025-04-15 03:12:10

Raytraced effects have gained increasing adoption in AAA titles, adding an extra graphics quality tier beyond traditional “ultra” settings. AMD has continuously advanced their raytracing implementation in response. Often, this involved fitting the GPU’s general architecture to the characteristics of raytracing workloads. On RDNA 4, examples include “out-of-order” memory access and dynamic register allocation. Both are targeted at raytracing, but other applications can benefit too, though perhaps not to the same degree.

RX 9070 Hellhound, with focus on the minimal LED lighting

Compared to RDNA 3, RDNA 4’s RT IP 3.1 gets doubled intersection testing, oriented bounding boxes, primitive compression, and other goodies. Fixed function hardware is tightly tied to predefined data structures, and AMD has unsurprisingly updated those data structures to utilize those updated features.

Doubled Intersection Engines, Wider BVH

RDNA 4’s doubled intersection test throughput internally comes from putting two Intersection Engines in each Ray Accelerator. RDNA 2 and RDNA 3 Ray Accelerators presumably had a single Intersection Engine, capable of four box tests or one triangle test per cycle. RDNA 4’s two intersection engines together can do eight box tests or two triangle tests per cycle. A wider BVH is critical to utilizing that extra throughput.

Raytracing uses a Bounding Volume Hierarchy (BVH) that recursively subdivides scene geometry. Each node represents a 3D box-shaped space, and links to sub-boxes. Intersection tests determine which link (child) to follow, until traversal reaches the bottom of the tree where the node contains triangle data instead of links to sub-boxes. Each traversal step therefore scopes down the intersection search to a smaller box. More intersection test throughput can speed up this traversal process.

But speeding up traversal isn’t as simple as doubling up intersection test throughput. Each traversal step is a pointer chasing operation, which incurs memory latency. GPUs have high cache and DRAM latency compared to a CPU, but excel at parallel compute. RDNA 4 moves to 8-wide box nodes, up from 4-wide ones in RDNA 2 and 3. A wider box node presents more parallel work at each step. More importantly, it allows a “fatter” tree that requires fewer traversal steps to reach the bottom. Thus the 8-wide BVH shifts emphasis from latency to throughput, avoiding a key GPU weakness.

Since RDNA 2, AMD has used a conservative raytracing strategy where a shader program controls the raytracing process from ray generation to result shading. During BVH traversal, a raytracing shader accesses RDNA 4’s Ray Accelerator with a new IMAGE_BVH8_INTERSECT_RAY instruction. This instruction takes a ray and a pointer to a 8-wide BVH node, and uses both Intersection Engines together. Output from both Intersection Engines head to a sorting unit, which can either sort the two 4-wide halves separately or sort across all eight results with a “wide sort” option. To speed up traversal, AMD had hardware traversal stack management in the LDS since RDNA 3. LDS stack management gets updated in RDNA 4 with a new DS_BVH_STACK_PUSH8_POP1_RTN_B32 instruction1.

An 8-wide BVH isn’t the only way to use RDNA 4’s doubled intersection test throughput. RDNA 4 adds an IMAGE_BVH_DUAL_INTERSECT_RAY instruction, which takes a pair of 4-wide nodes and also uses both Intersection Engines. Like the BVH8 instruction, IMAGE_BVH_DUAL_INTERSECT_RAY produces two pairs of 4 intersection test results and can intermix the eight results with a “wide sort” option. The traversal side likewise gets a DS_BVH_STACK_PUSH8_POP2_RTN_B64 instruction. AMD doesn’t describe the “BVH4x2” traversal algorithm, but it’s not hard to imagine what is considering what the two instructions above do. A ray can intersect multiple bounding boxes, creating multiple traversal paths. BVH4x2 almost certainly takes two of those paths in parallel, with two paths popped from the LDS and tested in the Ray Accelerator with each traversal step.

So far I’ve only seen AMD generate 8-wide BVH-es for RDNA 4. That includes DirectX 12’s procedural geometry example, 3DMark tests, Cyberpunk 2077, Elden Ring, GTA V Enhanced Edition, and Quake 2 RTX. BVH4x2 traversal is less efficient than using a 8-wide BVH, because it requires more memory accesses and generates more LDS traffic. Furthermore, BVH4x2 relies in having at least two valid traversal paths to fully feed the Ray Accelerator, and how often that’s true may vary wildly depending on the ray in question. I’m not sure why AMD added a way to utilize both Intersection Engines with a 4-wide BVH.

Oriented Bounding Boxes

BVH-es traditionally use axis aligned bounding boxes (AABBs), meaning a box’s boundaries are aligned with the 3D world’s x, y, and z axes. Axis aligned boxes simplify intersection tests. However, game geometry is often not axis aligned. In those cases, an axis aligned box may end up much larger than the geometry it’s trying to contain, creating a lot of empty space. Rays that intersect the empty space end up taking useless traversal steps into the box before further intersection tests realize the path was useless.

Chains hanging from Elden Ring’s Debate Parlor ceiling (RDNA 4 on the left and RDNA 2 on the right). RDNA 4 OBBs fit geometry better than axis-aligned boxes on RDNA 2. Screenshots are from Radeon Raytracing Analyzer with matched traversal counts

RDNA 4 addresses this with Oriented Bounding Boxes (OBBs), where the bounding box is rotated to better approximate geometry within it. Storing a 3×3 3D rotation matrix for each box would dramatically increase memory usage. RDNA 4 therefore strikes a compromise. Each box node only specifies one OBB rotation for all of its children. For further storage savings, the OBB matrix is not stored in the node at all. Instead, the node stores an OBB matrix index, which is looked up in a table of 104 predefined matrices. In code, the matrix is stored as 9 packed 6-bit indices, which refer to entries in a second level lookup table with 26 unique FP32 values3.

AMD’s OBB strategy therefore takes a best-effort approach to reduce useless box intersections with minimal storage cost. It’s not trying to generate perfectly optimal bounding box rotations every time. Thanks to this strategy, 8-wide box nodes remain reasonably sized at 128 bytes, and continue to match RDNA’s cacheline length. While the lookup table specified in code doesn’t necessarily show the exact hardware implementation, a simplistic calculation indicates the lookup table would use about 800 bytes of storage. That would make it small enough to fit in a small ROM within the Ray Accelerator.

In Elden Ring, Y is the up axis, not Z

Compromises of course leave room for improvement. Chains hanging from the ceiling in Elden Ring’s Debate Parlor present an optimal case for RDNA 4’s OBBs, because one rotation can fit well across all of the chains on each side. The chandeliers are another story, with four support chains each have different rotations. AMD selects an OBB that matches chain rotation in the Y (up) axis, but not X/Z. As a result the OBBs for each chain leave significant empty space that could result in false intersections. It’s still better than RDNA 2’s aligned bounding boxes, but there’s clearly room for improvement as well.

Using one OBB rotation across the box is likely a necessary compromise. A simple way to handle OBB intersection testing is to rotate the incoming ray so that the box and ray become axis aligned (just with a different set of axes). RDNA 4 adds a ray transform block to the Ray Accelerator. AMD doesn’t explicitly say the transform block helps with OBBs. Instead, it’s used for transitions between top and bottom level acceleration structures (TLAS/BLAS). DirectX raytracing splits the BVH into two levels because a BLAS can be reused many times with different positions and rotations. That’s convenient for handling several copies of the same object, like the same chair type placed several times around a room.

…the addition of a dedicated ray transform block, and this is used to offload the transformation that occurs as you transition from the top level of the ray acceleration structure into the bottom level…

AMD Radeon Press Briefing Video

A TLAS to BLAS transition would involve rotating the ray to allow axis-aligned intersection tests within the BLAS, a similar operation to rotating a ray for OBB tests. A key difference is that traversal steps involving OBBs may happen more often than TLAS to BLAS transitions. AMD’s Ray Accelerator aims to handle a box node every cycle. Transforming a ray would involve multiplying both the origin and direction vector by a 3×3 rotation matrix, which naively requires 36 FLOPs (Floating Point Operations) per transform. At 2.5 GHz that would be 5.04 ray transformation TFLOP/s across the RX 9070’s ray accelerators. Providing different ray rotations for all eight boxes in a box node would multiply that figure by eight.

AMD could also implement OBB-related ray transforms within the Intersection Engines. RDNA 4 can use OBBs with RDNA 2/3 legacy 4-wide box nodes, slipping the OBB matrix ID into into previously unused padding bits near the end of the 128B structure. BVH4x2 traversal may run into two nodes that specify different OBB rotations. That would require applying two ray transformations to fully feed both Intersection Engines. Even in that case, two ray transforms are far less expensive than eight, so AMD’s compromise makes sense.

Primitive Node Compression

Minimizing BVH footprint is critical. A smaller BVH reduces VRAM consumption, cuts bandwidth requirements, and makes more economical use of cache capacity. RDNA 2 and RDNA 3 took a basic step with compressed triangle pairs, where a 64 byte triangle node could store a pair of triangles that share a side. Intel also does this. RDNA 4 goes further by packing multiple triangle pairs into a new 128 byte compressed primitive node.

Intel calls single triangles “degenerates”, and I’m using that term for convenience.

RDNA 4’s compressed primitive node only stores unique vertices across its triangle pairs. Further gains come from finding the minimum number of trailing zeroes across vertex coordinates’s bitwise FP32 representations, and dropping those trailing zeroes from storage5. Code suggests a RDNA 4 primitive node can describe up to eight triangle pairs or up to 16 unique vertices.

Shared vertices marked from the example above. Sharing is caring and allows for better primitive node compression

In practice compression efficiency varies wildly depending on the nature of game geometry, but RDNA 4 often represents more than two triangle pairs in a 128 byte primitive node. While not mentioned in AMD’s presentation, RDNA 4 represents box extents with quantized 12-bit integers instead of FP32 values6. That lets RDNA 4 keep its 8-wide box node at 128 bytes, just like RDNA 2/3’s 4-wide box nodes.

Not to scale across the two, but RDNA 2/3’s triangle node places elements at well defined offsets, which allows for cheaper hardware. RDNA 4 of course supports the RDNA 2/3 format too, since hardware retains support for the older IMAGE_BVH_INTERSECT_RAY instruction

Like OBBs, primitive compression increases hardware complexity. Compressed primitive nodes don’t require additional compute in the Ray Accelerator. However, they do force it to handle non-aligned, variable length data fields. An Intersection Engine would have to parse the 52-bit header before it knows the data section’s format. Then, leading zero compression would require shifting the packed values to reconstruct the original FP32 values. Reducing memory footprint often comes with extra hardware complexity. In a latency critical application like raytracing, placing a higher burden on hardware is probably worth it.

BVH Optimizations in Practice

In Elden Ring and 3DMark’s Port Royal benchmark, AMD’s Radeon Raytracing Analyzer indicates that RDNA 4 achieves a decent BVH size reduction. The same curiously doesn’t apply to Cyberpunk 2077. However, Cyberpunk 2077 has a more dynamic environment with unpredictable NPC counts and movement paths, so the margin of error is definitely higher.

RDNA 4’s biggest wins come from reducing traversal step count. Lining up corresponding DispatchRays calls shows RDNA 4 goes through fewer traversal steps per ray. Cyberpunk 2077 is a particularly good case. RDNA 4 is still doing more intersection tests overall, because each traversal step requires eight intersection tests compared to four on RDNA 2, and traversal step count isn’t halved or lower. The additional work is well worth it though. GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy. Gains in Elden Ring are minor by comparison, but any reduction is welcome considering high GPU cache latency.

In a frame captured from 3DMark’s DXR feature test, which raytraces an entire scene with minimal rasterization, the Radeon RX 9070 sustained 111.76G and 19.61G box and triangle tests per second, respectively. For comparison the RDNA 2 based Radeon RX 6900XT did 38.8G and 10.76G box and triangle tests per second. Ballparking Ray Accelerator utilization is difficult due to variable clock speeds on both cards. But assuming 2.5 GHz gives 24% and 10.23% utilization figures for RDNA 4 and RDNA 2’s Ray Accelerators. RDNA 4 is therefore able to feed its bigger Ray Accelerator better than RDNA 2 could. AMD has done a lot since their first generation raytracing implementation, and the cumulative progress is impressive.

Final Words

RDNA 2 introduced AMD’s first hardware raytracing implementation in the PC scene. It took a conservative approach to raytracing by accelerating intersection testing but little else. AMD has made steady progress since then, shaping GPU hardware and the raytracing workload to match each other. RDNA 4 continues to use RDNA 2’s high level raytracing strategy, with a compute thread managing the raytracing process all the way from ray generation to traversal to result handling. But over several generations, AMD’s engineers have piled on improvements that put RDNA 4 leagues ahead.

AMD’s tools like the Radeon Raytracing Analyzer and Radeon GPU Profiler provide a fascinating look into how those improvements work together under the hood. Open source code further paints a picture where AMD is working nonstop on their raytracing hardware. Unused (possibly Playstation related) RT IP 1.0 and 3.0 levels provide more snapshots into AMD’s hardware raytracing evolution.

Debate Parlor in Elden Ring viewed through Radeon Raytracing Analyzer, with coloring to reflect traversal counts with matched color scales. RDNA 4 on the left, RDNA 2 on the right

Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.

AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies. I’m excited to see how everything plays out as AMD, Intel, and Nvidia evolve their raytracing implementations.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. “RDNA4” Instruction Set Architecture Reference Guide

  2. primitiveNode.hlsli in AMD’s GPU Ray Tracing Library

  3. OrientedBoundingBoxes.hlsl and ObbCommon.hlsl in AMD’s GPU Ray Tracing Library

  4. EncodeHwBVH3_1.hlsl in AMD’s GPU Ray Tracing Library

  5. PrimitiveStructureEncoder3_1.hlsl in AMD’s GPU Ray Tracing Library. Describes trailing zero compression, and ComputeCompressedRanges has each lane find its vertex in the LDS during BVH build, set corresponding bits in a bitmask, then count set bits to find the unique vertex count

  6. Multiple RT IP 3.1 BVH building functions call ComputeQuantizedBounds with numQuantBits=12, which then calls ComputeQuantizedMin/ComputeQuantizedMax, which appears to quantize to a 12-bit integer because the max value is (1 << numQuantBits -1) * 1.0f. It’s the maximum integer value that’ll fit in the specified quantization bits multiplied by a FP value of 1 to provide the max quantized value as a float.

  7. BoxNode1_0.hlsli, defines a 4-wide box node for RT IP 1.0

Dynamic Register Allocation on AMD's RDNA 4 GPU Architecture

2025-04-06 01:38:25

Modern GPUs often make a difficult tradeoff between occupancy (active thread count) and register count available to each thread. Higher occupancy provides more thread level parallelism to hide latency with, just as more SMT threads help hide latency on a CPU. But while a CPU can use all of its SMT threads regardless of what code it's running, the same doesn't apply to GPUs. GPU ISAs offer a large number of very wide vector registers. Storing all registers for all thread slots would be impractical because register files must balance capacity with speed and die area usage.

For example, RDNA 4's ISA lets instructions address up to 256 vector general purpose registers (VGPRs). Each register is 1024 bits wide in wave32 mode, and each RDNA 4 SIMD has 16 thread slots. The SIMD would need a 512 KB register file to hold 256 registers for all 16 threads. In practice register requirements vary across different GPU workloads. RDNA 4, like many other GPUs, uses a smaller register file and allocates depending on what threads require. Code that needs a lot of registers can do so at the cost of less thread-level parallelism, while code that uses fewer registers can run more active threads and be less sensitive to latency. RDNA 4 desktop GPUs have a 192 KB register file per SIMD, so a GPU kernel can use all thread slots (achieve maximum occupancy) if it uses 96 or fewer vector registers.

A bigger register file obviously improves the occupancy and register usage tradeoff situation. RDNA increased SIMD register file capacity to 128 KB, up from 64 KB on GCN. RDNA 3 introduced a 192 KB register file configuration for high end GPUs, where die area is likely less of a concern. But that strategy isn’t efficient for raytracing.

AMD notes that ray traversal and hit/miss handling have different VGPR requirements. AMD uses an inline raytracing model where all raytracing stages run within the same thread. A raytracing shader’s VGPR allocation has to be set to the maximum that any stage requires, because a thread’s register allocation remains static throughout its lifetime. Even if code that needs a lot of registers only accounts for a small part of execution time, that high VGPR allocation will limit active thread count for the duration of the workload. Raytracing is particularly latency sensitive, and AMD would like to run as many threads (rays) in parallel as possible to help absorb latency.

Dynamic Register Allocation

Therefore RDNA 4 introduces a new dynamic VGPR allocation mode. In this mode, a thread starts with a minimum VGPR allocation and changes it throughout it’s lifetime. Rather than specify how many VGPRs a shader will use, the driver tells GPU to launch it in dynamic VGPR mode. A chip-wide SQ_DYN_VGPR register directly sets active thread count per SIMD, or occupancy, rather than having that inferred from shader VGPR usage. SQ_DYN_VGPR also controls other dynamic VGPR mode parameters, like VGPR allocation block size and deadlock avoidance mode.

As defined in Linux kernel code. I couldn’t find references/usages in either Linux or LLVM, so I’m guessing what each field does

Each enabled thread slot gets a single reserved VGPR block, and a newly launched thread starts with just that VGPR block active. When the thread needs more registers, it requests a new VGPR count using a s_alloc_vgpr instruction. s_alloc_vgpr attempts to allocate more registers if called with a value higher than the current allocation, or frees registers if called with a lower value. Changing VGPR allocation affects the upper end of the usable VGPR range, just like with non-dynamic VGPR allocation. Hardware hands out VGPRs in blocks of 16 or 32, depending on how the driver sets up SQ_DYN_VGPR. A thread can allocate up to eight blocks, so the driver must select the larger block size and give up some allocation granularity if a thread needs to use more than 128 VGPRs.

Allocation requests don’t always succeed. s_alloc_vgpr sets the Scalar Condition Code (SCC) to indicate success, or clears it on failure. SCC is analogous to a flag register on CPUs, and is used for branching and add-with-carry. Shader code has to check SCC to determine if an allocation request succeeded. If an allocation request fails, a shader could in theory try to find other useful work to do while periodically retrying the allocation. But doing so would be quite complex, so in practice a shader will busy-wait until allocation succeeds.

Example of dynamic register allocation used in the DirectX procedural geometry example

Therefore dynamic VGPR mode turns the occupancy question on its head. A SIMD can have as many active threads as the driver feels like, regardless of register allocation. But theoretical occupancy doesn’t tell the whole story. Threads can still get blocked waiting on VGPR allocation. A SIMD could have all thread slots filled, but some of those threads could be busy-waiting on VGPR allocation rather than making useful progress.

Deadlock Avoidance

Busy-waiting can become more than a performance inconvenience. Dynamic VGPR allocation can lead to deadlock. AMD knows this, and describes how that can happen in RDNA 4’s ISA manual.

I think a deadlock case can be more general than what AMD describes. If every thread in a SIMD needs to allocate more registers, but hardware doesn’t have enough free registers to satisfy any request, every thread will get stuck forever. That’s a deadlock, even if there are technically registers available.

AMD mitigates some deadlock scenarios with a deadlock avoidance mode. The ISA manual is light on details, only saying it reserves just enough VGPRs for one thread to reach maximum VGPR allocation at all times. Each thread can allocate up to eight VGPR blocks, and one block comes reserved with the thread slot, so deadlock avoidance mode would reserve 7 VGPR blocks. I believe deadlock avoidance mode works by only allowing one thread to allocate registers from the reserved pool at a time. In short:

  1. Base case: No reserved registers allocated. Any request can proceed

  2. From (1), any combination of allocation requests from all threads will allow at least one thread (say thread A) to succeed

  3. From (2), no other thread can allocate from the reserved pool, allowing thread A to increase its register allocation to the maximum should it need to.

  4. Eventually A will leave its high register usage code section, or terminate completely, and thus free up registers for other threads to do the same.

Needless to say, that situation isn’t great for performance because it can serialize useful work across threads. But getting to the finish line slowly is better than not getting there at all.

Deadlock avoidance mode isn’t foolproof. If the programmer manages to meet three conditions:

  1. Two threads need to allocate registers

  2. The high register usage sections of both threads depend on each other, for example in a producer consumer model

  3. No other thread can give up their registers until the two threads above make progress

Then they can run into a deadlock even with deadlock avoidance mode enabled. Programmers should probably avoid cross-thread dependencies in dynamic VGPR mode, unless they’re confident threads only wait on each other in low VGPR usage sections.

Dynamic VGPR Mode Limitations

As with many new features, dynamic VGPR mode isn’t a one-size-fits-all solution. It’s narrowly targeted to start, and can only be used with wave32 compute shaders. Graphics shaders like pixel and vertex shaders can only use the regular non-dynamic launch mode. The same goes for wave64 shaders of any type.

A workgroup of threads launched in dynamic VGPR mode will “take over” the equivalent of a GPU core. That would be a Workgroup Processor (WGP) in WGP mode, or a Compute Unit (CU) in CU mode. Thus dynamic and non-dynamic threads can’t coexist on the same GPU core.

Registers used to specify various compute program launch parameters

Dynamic VGPR mode may be less efficient at using register file capacity. Each enabled thread slot gets a reserved VGPR block, regardless of whether a thread is actually running in that slot. A workload that doesn’t have enough parallelism to fill all enabled thread slots would waste those reserved registers. Deadlock avoidance mode would set aside more registers that could have been easily allocated in non-dynamic mode. Drivers can reduce reserved register count by disabling deadlock avoidance mode or reducing thread slot count. Both of those options come with obvious downsides. In wave32 mode, non-dynamic register mode can allocate up to 256 registers in 24 entry blocksa on current RDNA 4 GPUs. That offers finer granularity than the 32 entry blocks needed to give a thread 256 registers in dynamic VGPR mode.

Nvidia’s Dynamic Register Allocation

AMD isn’t the only GPU maker that lets a thread adjust register allocation mid-execution. Nvidia introduced a setmaxnreg PTX instruction in Hopper, and that’s carried forward to Blackwell consumer GPUs. setmaxnreg superficially acts like AMD’s s_alloc_vgpr, letting the calling thread request a different register allocation. However Nvidia’s dynamic register allocation works very differently from AMD’s, and is probably better called register reassignment. Nvidia for their part never gave this mechanism a name.

Nvidia doesn’t use a separate launch mode. Kernels always launch the regular way, with a specified register allocation that also determines how many threads they can run concurrently. For example a compute shader that uses 96 registers on Blackwell will only be able to run 5 concurrent threads in each SM sub-partition. After threads launch, they can call setmaxnreg to shift registers between threads in the same workgroup. Unlike s_alloc_vgpr, setmaxnreg‘s register pool is whatever the workgroup started out with. If every thread calls setmaxnreg and requested register count across threads is greater than what the workgroup started with, they will deadlock regardless of how much free space the register file may have.

As an aside, setmaxnreg is a PTX instruction. PTX in an intermediate level programming language for Nvidia GPUs with an assembly-like syntax. It isn’t assembly, which Nvidia calls SASS. However PTX is meant to give more control over emitted instructions than a C-like high level language. Therefore PTX instructions often have similarities with SASS instructions, and can offer hints about the underlying ISA.

The semantics around setmaxnreg suggest Nvidia’s mechanism is geared towards tightly orchestrated register swapping between threads. It’s not like AMD’s free-flowing dynamic allocation behavior where different threads can be out-of-phase with each other, so to speak. Nvidia’s “warpgroup” likely refers to threads sharing the same SM sub-partition, and thus the same register file.

The same setmaxnreg instruction must be executed by all warps in a warpgroup. After executing a setmaxnreg instruction, all warps in the warpgroup must synchronize explicitly before executing subsequent setmaxnreg instructions. If a setmaxnreg instruction is not executed by all warps in the warpgroup, then the behavior is undefined

Miscallenous Instructions, Parallel Thread Execution ISA Version 8.7

A determined developer could emulate AMD’s initial dynamic VGPR state on Nvidia by with a workgroup that allocates all register file capacity in a SM, then immediately has every thread trim its allocation down to the minimum. But after that, synchronization requirements on Nvidia would make it difficult to emulate AMD’s independent allocation behavior. setmaxnreg‘s scalar-only input makes it harder to look up a desired allocation value from memory. Of course difficult doesn’t mean impossible. A register input can be emulated with a sufficient application of conditional branches, but let’s not think about that too much.

Not Hopper or Blackwell, but have a Nvidia related image to spice things up. I’m sick of seeing AI generated images everywhere, so I’m going to start taking more pictures with my DSLR and post them

In exchange for less flexibility, Nvidia should have no problem mixing “dynamic” and regular threads on the same SM. Nvidia can also adjust register allocation with finer granularity than AMD. The latter can be especially important because Nvidia has smaller 64 KB register files, and waste from “slack” register file usage can be even more painful.

Nvidia’s register reassignment mechanism isn’t well suited for AMD’s raytracing use case. However, Nvidia’s raytracing design likely doesn’t need it. Nvidia hardware uses a DXR 1.0 raytracing model. If it works like Intel, raytracing stages execute as separate thread launches on the SMs. Regular vector register allocation that happens at each thread launch would already solve the problem AMD faces with all-in-one raytracing shaders.

And Intel?

Intel’s documentation explicitly states that raytracing stages execute as separate thread launches. But even if they didn’t, Intel would benefit less from dynamic register allocation than AMD. Intel GPUs used fixed register allocation until very recently. Each thread gets 128 registers whether it needs them or not. More recent GPUs like Battlemage add a “large GRF” mode that cuts occupancy in half to give each thread 256 registers. There’s no option in between.

Intel’s Arc B580

Therefore Intel can maintain full occupancy with a higher per-thread register count than either AMD or Nvidia. Dynamic VGPR allocation is only useful if it helps increase occupancy in the first place – that is, the GPU can’t achieve full occupancy with non-dynamic VGPR allocation. If Intel were to dynamically allocate registers, the very coarse register allocation granularity may result in a more threads getting blocked than on AMD.

Final Words

AMD’s dynamic VGPR allocation mode is an exciting new feature. It addresses a drawback with AMD’s inline raytracing technique, letting AMD keep more threads in flight without increasing register file capacity. That in turn makes RDNA 4 less latency sensitive in raytracing workloads, likely with minimal power and area cost. Raytracing shaders that use more than 96 VGPRs are attractive targets for the dynamic VGPR feature.

Profiling Quake 2 RTX under Radeon Graphics Profiler. AMD chose to use fully inlined raytracing shaders, and no dynamic VGPR allocation, despite the shader being limited to 9 threads (out of 16 slots) due to VGPR usage.

Raytracing shaders on AMD can either inline all raytracing stages, or use an “indirect” mode where different stages are executed in separate function calls. So far, I’ve only seen AMD use dynamic VGPR allocation in indirect mode. Raytracing stages all take place within the same thread in both modes, but perhaps function call sites provide a convenient place to adjust VGPR allocation. After all, a function has clearly defined entry and exit points. AMD often prefers to inline raytracing stages to avoid function call overhead. I have not seen dynamic VGPR mode used when raytracing stages are inlined, even when raytracing shader occupancy is VGPR limited.

The RX 9070 provided by AMD

Certainly s_alloc_vgpr isn’t limited to function call sites, so I wonder if future AMD drivers will be more trigger-happy with dynamic VGPR mode. Conversely, AMD uses dynamic VGPR allocation in indirect mode even when non-dynamic allocation could have achieved full occupancy. Doing so shouldn’t hurt performance, but it does suggest driver decisions aren’t so fine grained at the moment.

Setting “Disable raytracing shader inlining” using AMD’s tools makes the driver use raytracing shaders with function calls, which also use dynamic register allocation. Done here to illustrate effect on occupancy

Generic compute workloads could benefit from dynamic VGPR mode too, assuming AMD does work to expose the feature through various toolchains. Some of Nvidia’s GPGPU libraries take advantage of setmaxnreg, so there’s probably compute applications for AMD’s dynamic VGPR feature too.

At a higher level, features like dynamic VGPR allocation paint a picture where AMD’s GPU efforts are progressing at a brisk pace. It doesn’t feel like an easy feature to implement. Thread register allocation could be non-contiguous in the physical register file, complicating register addressing under the hood. Features like deadlock avoidance would demand additional work. With regards to raytracing, dynamic VGPR allocation shows there’s plenty of progress to be made within AMD’s single-shader raytracing model. Along with breaking false cross-wave memory dependencies, AMD seems determined to keep stamping out performance limiters with each generation.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. “RDNA 4” Instruction Set Architecture Reference Guide

  2. Nvidia setmaxreg

Footnotes

a. RDNA 4’s ISA manual indicates the 24 register allocation granularity only applies to devices with 1536 VGPRs per SIMD, or 192 KB register files. Other RDNA 4 devices allocate VGPRs in blocks of 16 registers, and likely have a 128 KB register file. RDNA 3 used smaller 128 KB register files in lower end devices, reserving 192 KB register files for the highest end SKUs. As RDNA 4 SKUs with non-192 KB register files do not exist at the time of writing, there is no need to discuss them in the article proper. However, such devices may launch in the future and it’s something to be aware of.

Inside Nvidia's GeForce 6000 Series

2025-04-01 18:48:14

2025 has kicked off with a flurry of GPU activity. Intel's Arc B580 revealed that it's still possible to make a mid-range GPU with more than 8 GB of VRAM. AMD's RDNA 4 marked the continuation of a longstanding AMD practice where they reach for the top-end, before deciding it wasn't worth it after all. Nvidia too has a new generation to celebrate 2025, and their 5000 series launch has come and gone without cards on shelves. But bigger numbers are better, so it's time to talk about the GeForce 6 series.

Gamers demand higher quality graphics with each generation. No one knows this better than Nvidia, so GeForce 6000 series cards are built to deliver near-cinematic quality at framerates high enough to support interactive gameplay. GeForce 6000 GPUs, or the GeForce 6 series for short, are built with the highly parallel nature of graphics rendering in mind. At the same time, they take a great leap forward in terms of programmability, opening up exciting new possibilities for complex in-game effects.

Overview

Graphics rendering involves transforming vertex coordinates from 3D space to 2D screen space before calculating the final pixel colors, a process known as rasterization. Both stages are inherently parallel tasks, and map well to hardware with large arrays of execution units. Accordingly, a GeForce 6 series GPU is a massively parallel machine. It has a strong complement of fixed-function graphics hardware, but the real power of the GPU lies in a collection of vertex and pixel shader cores. These programmable components execute shader programs provided by the game instead of carrying out preset functions. They also serve as basic building blocks, letting Nvidia scale to different power, price, and performance targets. The highest end GeForce 6000 series chip, NV40, implements 6 vertex shader and 16 pixel shader cores.

From Nvidia’s IEEE paper

A highly parallel machine needs a high bandwidth memory subsystem to keep it fed. GeForce 6 series products can feature up to a 256-bit GDDR3 DRAM setup, giving it significantly more memory bus width than typical desktop CPUs. The GPU features a L2 texture cache shared across all pixel and vertex shader cores, which enables short-term reuse of fetched texture data. Nvidia did not disclose cache sizes at this time, but they aim for a 90% hitrate with many misses in flight, rather than the 99% hitrate one often sees with CPU caches. The GPU communicates with the host system via the popular AGP interface, but can also support the upcoming PCI Express standard.

Vertex Shader Core

Vertex shader programs transform coordinates from 3D to 2D screen space. It may sound like a trivial task that involves little more than a camera matrix multiplication and perspective division. But programmable vertex shaders open up new tricks. For example, a vertex shader can sample a texture and use it as a displacement map. Besides supporting texture accesses, GeForce 6000's vertex shader cores support branches, loops, and function calls. Much of this capability was previously unthinkable on anything outside a real CPU, demonstrating the exciting nature of GPU evolution.

Vertex shader execution starts with instruction fetch from a 512-entry instruction RAM. Nvidia uses 128-bit vertex instructions from the driver format, which are translated into a 123-bit internal format. Thus the instruction RAM has approximately 8 KB of capacity. DirectX 9's vertex shader 3.0 standard mandates a minimum of 512 instruction slots, and Nvidia's vertex shader core ISA is closely aligned to DirectX 9's HLSL instructions. Thanks to instruction limits, shader programs are immune to performance losses from instruction cache misses, which CPU programs can often suffer. Furthermore, accessing the instruction RAM doesn't involve tag comparisons like a cache would, saving power.

GeForce 6000 Vertex Shader instruction layout, inferred from Mesa code

DirectX 9 vertex shader HLSL instructions broadly fall into scalar and vector categories. Scalar instructions include special operations like inverse square roots. Vector instructions generally involve basic operations like multiply-add, and operate on 128-bit vectors of four 32-bit values. GeForce 6000's vertex shader pipeline is highly optimized for this arrangement, and features separate vector and scalar pipelines. Each ISA instruction specifies both a scalar and vector operation, letting the vertex shader core exploit parallelism in two dimensions within one instruction stream. Vectors specified by the DirectX 9 shader program provide vector-level parallelism. Any scalar+vector dual issue opportunities found by Nvidia's compiler provide additional parallelism.

A third source of parallelism comes from multithreading, and serves to hide latency. The vector operation slot can accept texture sampling instructions. Memory accesses from a vertex shader should still be relatively uncommon, so vertex shader cores don't have a L1 texture cache tied to their texture fetch unit. Nvidia expects a shader program will need 20-30 instructions to hide texture fetch latency, which can be hard to achieve from a single thread. Therefore, vertex shader cores can each track up to three threads and switch between them to hide latency.

Instruction inputs can come from registers or constant RAM. Both consist of 128-bit vector entries to match vector execution width. Register files are split into input, output, and temporary registers. The input and output registers each have 16 entries, and are read-only or write-only respectively from the shader program's point of view. The temporary register file supports both reads and writes, and has 32 entries. DirectX 9's vertex shader 3.0 specification lets a shader program address up to 32 registers, but Nvidia might share the register file between multiple threads. If so, a vertex shader program should use no more than 10 temporary registers to achieve maximum occupancy.

Pixel Shader Core

Pixel shaders, or fragment shaders, do much of the heavy lifting because rendering a scene typically involves processing far more pixels than vertices. Accordingly, a GeForce 6000 GPU can have up to 16 pixel shader cores. The pixel shader cores themselves are highly programmable just like the vertex shader cores, with branching support among other goodies. However, pixel shader cores are built very differently to exploit increased parallelism typically present at the pixel level.

GeForce 6000's pixel shaders use 128-bit instructions, though the encoding is substantially different from the one used in vertex shaders thanks to hardware differences. Nvidia has chosen to support up to 65536 pixel shader instructions, exceeding the DirectX 9 minimum specification of 512 instruction slots by a wide margin. Using all instruction slots would consume 1 MB of storage, so pixel shader cores might use an instruction cache.

The fragment processor has two fp32 shader units per pipeline, and fragments are routed through both shader units and the branch processor before recirculating through the entire pipeline to execute the next series of instructions

From The Geforce 6 Series GPU Architecture, Emmet Kilgariff and Ramdima Fernando

Where Nvidia's vertex shader core operates much like a CPU with 3-way SMT to hide latency, the pixel shader core uses a SIMD execution model across threads. That parallelism, often referred to as SIMT (Single Instruction Multiple Thread), applies on top of the SIMD you get within a thread from using multi-component vectors. Rather than tracking three separate threads, Nvidia groups many pixel shader invocations into a vector and effectively loops through the "threads" in hardware. This approach lets Nvidia keep thousands of "threads" in flight at low cost, because threads in the same vector must execute the same instruction and cannot take an independent execution path from other threads. Only the data being processed is different.

Programmers must pay attention to divergence penalties with this SIMT threading model. If different threads within a vector take different directions on a conditional branch, the pixel shader core will execute both sides of the branch with non-active threads masked off. That contrasts with the vertex shader core's MIMD execution model, which allows penalty free branching even if branch directions diverge across threads running in the same core. Nvidia suggests keeping branches coherent across regions of over 1000 pixels, or approximately 256 2x2 pixel quads, hinting at very long vector lengths.

Keeping that much work in flight is critical to hiding latency, but places pressure on internal chip storage. DirectX 9 lets pixel shaders address 32 temporary registers, which continue to be 128-bits wide. Keeping 256 threads in flight would require 128 KB of register file capacity per pixel shader core, which will not be achieved in GPUs for several years. GeForce 6000 uses smaller register files of unknown size. Nvidia says pixel shader programs can get the maximum number of threads in flight if they use four or fewer 128-bit registers. As ballpark estimate, 256 threads with four registers per thread would require 16 KB of register file capacity.

The pixel shader core's two 128-bit vector units are placed one after another in different pipeline stages. Both can execute four FP32 operations per cycle, though only the lower one can do multiply-add. The upper one can handle special functions and texture address calculation. Texture operations are issued between the two execution unit stages. Peak FP32 throughput is 12 operations per cycle. That can be achieved for example by issuing a vector FP32 multiply in the upper stage and a FP32 multiply-add in the lower one.

From a shader program's perspective, the upper and lower vector units together can complete two vector operations per cycle. Compared to the vertex shader cores, the pixel shader's sequential "dual issue" arrangement lets the upper unit forward its results to the lower one. Thus two dependent instructions can "dual issue". Besides interleaving instructions for the two vector units, Nvidia's compiler can pack operations that work on different subsets of vector elements into a single instruction, which improves vector unit utilization within a thread. FP16 execution can improve throughput even further. Full 32-bit precision is often not necessary for graphics rendering, especially when it comes to pixel colors. Both vector execution units in the pixel shader core can execute FP16 operations at double rate. Using FP16 also halves register file usage for those values, which in turn can also improve occupancy and therefore latency hiding.

Input registers not drawn, but all shader programs take input registers and write their results to output registers

Texture sampling is an important part of pixel shading, so pixel shader cores get an optimized texture sampling path compared to the vertex shader cores: each core has a L1 texture cache, backed by a chip-wide L2 texture cache.

Going Beyond Pixel Shading

Pixel shader programs normally output pixel colors, but colors are really just numbers. The bulk of GeForce 6000's massive parallel compute power is concentrated in its array of pixel shaders, and having lots of GFLOPs is great for non-pixel things too. Furthermore, the flexibility of the pixel processing pipelines can let creative programmers do just about anything.

For example, ray tracing is a fundamentally different approach to graphics rendering (compared to rasterization), that involves tracing light rays through a scene. Ray tracing was largely confined to offline applications because of its compute power requirements. However, GeForce 6's programmable pixel shaders are up to the task of real-time rendering, at least for simple scenes.

From Stanford's presentation on the Brook API

The possibilities go beyond different graphics rendering techniques. The power of programmable shaders has spurred the development of new GPU programming APIs not directly aimed at graphics. Stanford's Brook API targets general purpose compute on GPUs. Its programming model is tightly tied to the way GPUs are optimized for parallel work. Getting up to speed on such a model can take some getting used to, especially as most programmers have been taught using a serial execution model. But researchers and other developers doing highly parallel and highly regular data processing should take note of these APIs.

Difficulties Remain

Significant barriers still stand in the way of running any parallel task on a GPU: shader programs access memory through textures bound to it; textures have limited size compared to CPU-side memory allocations; floating-point precision is often lacking compared to a full-spec IEEE 754 implementation; shaders can only execute for a short length of time without stalling the display; textures can't be modified during shader execution, etc.

Developers also have to move data between CPU and GPU memory spaces to provide the GPU with data and get the results. The latter can be problematic because GPUs are optimized for presenting pixel shader output as a frame on-screen, before quickly overwriting it with a subsequent frame. Copying data back from the GPU can run into host interface limitations.

Nvidia is no doubt aware of these limitations, and is working to address this. GeForce 6 will support the incoming PCI Express standard alongside AGP. PCI Express's increased bandwidth moves the GPU one step closer to being an accessible parallel accelerator.

Final Words

GeForce 6's pixel and vertex shader pipelines are more flexible than ever, and shows Nvidia is taking programmable shaders seriously. Many of the capabilities introduced in GeForce 6 may seem excessive for current gaming workloads. It's hard to imagine anyone writing a shader hundreds of instructions long with loops, calls and branches mixed in. What GeForce 6's capabilities show is that Nvidia is competing on features beyond basic graphics rendering. It's part of a larger trend arising from the move away from fixed-function hardware, and has exciting implications for GPUs. Perhaps soon, we won't be calling these cards GPUs anymore, considering they can do much more than render graphics.

Despite its programmability, GeForce 6000 GPUs continue to focus strongly on graphics. Nvidia's shader ISA remains closely tied to DirectX 9 specifications, ensuring shader programs in games run well on the hardware. And the hardware is quite powerful; a high-end GeForce 6000 chip has over 200 million transistors. That's made possible by IBM's advanced 130nm process. Providing all that processing power demands serious power delivery too, so high end cards use a pair of molex connectors. Molex connectors are a time-tested standard, with thick pins and wires that can reliably supply power to a variety of peripherals without melting.

In conclusion, GPUs are advancing at an incredible pace. 2005 is an exciting time to be alive. Graphics rendering technologies are moving in lock-step with the nation's economy towards 2008, and undoubtedly everyone is looking forward to that bright future.

Wait, what year is it again? Oh, and happy April Fools!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. Nvidia GeForce 6800 Hot Chips presentation

  2. Emmet Kilgariff and Ramdima Fernando, The GeForce 6 Series GPU Architecture

  3. John Montrym and Henry Moreton, The GeForce 6800

  4. Ashu Rege, Shader Model 3.0

  5. Matthias Wloka, GeForce 6 Series Performance

  6. Ian Buck et al, Brook for GPUs: Stream Computing on Graphics Hardware

An Interview with Oxide's Bryan Cantrill

2025-03-28 08:15:21

Hello you fine Internet folks,

Today we have an interview with Bryan Cantrill from Oxide Computer Company.

Cloud computing has been a tour de force in the computing industry, with many businesses and even governments moving over to cloud services for their computing infrastructure. This gives these companies and governments near complete isolation from any hardware issues that may crop up due to the sheer size of many cloud providers, which allows them to automatically migrate VMs to other systems while the hardware issue is being resolved. But not everything can or even should be moved to the cloud, for various reasons such as compliance reasons, cost, etc. This means that some folks may have to stick with on-premises compute, but the cloud model of automatic migration in the case of hardware failure is still relevant and this is where Oxide fits in.

What Oxide is building is effectively an on-premises cloud. To start with, Oxide’s philosophy is to treat a rack of servers in the same way as a hyperscaler (AWS, GCP, Azure, etc) does. So, instead of directly accessing a single node in a server and running your workload, you start up a VM in the Oxide control panel just like in AWS or Azure. This allows for automatic fail-over of VMs in the event of a hardware failure. Now, in order to facilitate this cloud-like behavior, Oxide is building their own racks, which resemble a rack that you may be able to find in a hyperscaler’s datacenter. I had the chance to interview their CTO, Bryan Cantrill, while I was in the Bay Area this past week to ask about their hardware and what they are doing.

Hope y'all enjoy!

Transcript below has been edited for conciseness and readability.

GEORGE: Hello you fine internet folks! Today, we're here at Oxide Computer Company, and I have with me, Bryan.

BRYAN: Heya!

GEORGE: Who are you, and what do you do at Oxide?

BRYAN: Hi, I'm Bryan Cantrill, I'm the CTO at Oxide Computer Company, we're here at our office in Emeryville- our office and lab, kind of our playhouse here in Emeryville.

We're a computer company! We're a modern computer company, we are a rack-scale computer company.

So, this is the Oxide rack behind you, and what we have done- our observation was, if you look at those folks deploying compute at scale- Amazon, Google... hyperscalers right? They've all built their own computers. And we were (I along with my cofounder) were at a public cloud company, Joyent, that was purchased by Samsung-

GEORGE: Interesting!

BRYAN: It was interesting! And, we were... after Samsung bought the company, they were really trying to deploy at Samsung-scale. And we were deployed on commodity gear; we were deployed on Dell, Supermicro, some HP, some Arista... and uh...

GEORGE: Trying to match all that stuff can be very difficult.

BRYAN: When we hit scale, everything broke.

GEORGE: I can imagine.

BRYAN: And, to be fair, everything broke in hardware and software, but the difference is, with the software, we could actually go fix it. And we fixed a bunch of our software systems, but then the problems you're left with are those problems, that are at the hardware/software boundary. And... it was pretty frustrating, and you look at like, "how did these other folks do it?" And you realize, they've done their own machines.

GEORGE: So, what makes... (if you wanna show to the audience) one of these?

BRYAN: Right! So this is an Oxide sled, and this doesn't look like a normal server computer, right?

GEORGE: No, no it doesn't. It looks like a blade.

BRYAN: It looks like a blade, right. And actually, if you look at the back... it even looks more like a blade. Lemme take off the Kapton tape there... It blind mates in the power first of all, we - like everybody running at scale - we run a DC bus bar, up and down the rack. So you've got an actual power shelf, that contains [bridge] rectifiers, those rectifiers then take you from AC to DC, you run DC on the bus bar,

GEORGE: DC 48v?

BRYAN: Uh, yeah, 54v. That is the way that everybody at scale runs, with a DC bus bar.

... you can't buy a DC bus bar based machine, DC bus bar-based rack, from Dell, HP, Supermicro! And they'll tell you nobody wants it.

GEORGE: [Sounds of incredulity] Hmmm...

BRYAN: Right, exactly!

GEORGE: The fact that this exists tells me otherwise!

BRYAN: Exactly, it definitely does, and one of the things I've appreciated... we've kind of accreted the server architectures that we have.

This traditional [server] architecture has accreted over time, and until we took a clean sheet of paper, you really don't appreciate just how many things are broken with it! One of the things you commented on, is the noise; it's so much quieter.

GEORGE: Yeah, and, it's off-camera right here, but there's one rack running right now over to my right side... you can hear it, but it's not a tinny noise. It's a very... almost sort of wind-blowing noise. Which is exactly what it is.

BRYAN: Right, and you know what's funny? We didn't design this thing to be acoustically pleasant, ...

GEORGE: It just turned out like that?

BRYAN: It just turned out like that. And one of the things you'll appreciate is, when you look at the acoustic unpleasantness in a traditional server... yes, you've got a bunch that's coming from those small fans at the back; a bunch of it is also coming from those fans on the power supplies. Because you've got those AC power supplies...

GEORGE: And it's all like, 40mm fans.

BRYAN: Those are *tiny* fans, and, those AC power supplies, you take them apart... they're crammed. So there's a high static pressure that [the fans] have to overcome; those fans are workin' hard! And it's hot. And of course, that fan is a thing that blows on AC power supplies, so you have two of them. So now we've got two AC power supplies, in every one of these servers, all these power cords... and it's just like, the whole thing is a mess.

And... that's just the beginning; the DC bus bar is to me just the beginning.

GEORGE: So, speaking of... well, you say basic- the way a computer is booted, is you usually start with what's known as a Basic Input Output System, the BIOS,

BRYAN: The BIOS... yeah.

GEORGE: Now, in the early 2000s, this was replaced by UEFI,

BRYAN: UEFI, yes, Itanium's gift to the world!

GEORGE: Yeah, and while that works perfectly well for your average laptop or desktop, when you get to this scale,

BRYAN: It doesn't make sense.

GEORGE: Why is that?

BRYAN: Because it's giving you this kind of optionality you actually don't want. When you have something of this scale - and we have co-designed our host operating system with our hardware - you don't need that optionality, of booting kind of... I don't need to boot DOS on this thing!

GEORGE: [Laughing] You don't want DOS on these machines?!

BRYAN: Okay, that would be kind of entertaining, but... we actually don't need any of that. But we have preserved all of this kind of ancient optionality in the BIOS. A big problem that we have with the BIOS, is that the BIOS has to boot the system in order to boot the system.

So one of the things the BIOS has to do... it needs to find like, how do I boot this thing? I need to actually do I/O, to pull a boot image off of somewhere. I/O, as we know, everything's complicated... you can't just like "do I/O", like, we actually have to bring up PCIE engines, you have to bring up all the CPUs... so you're doing all this work to boot the system, and then you find the image you want to boot, and now you have to be like, "okay, now we have to pretend like we were never here".

So it then tries to- we call it "setting the machine backward", where it makes the machine *look like* it has not been booted, when it executes that first operating system instruction. But in reality an *entire city* has been constructed, and ploughed under; and the operating system can actually see the artifacts of that over time. There's something called System Management Mode, SMM…

GEORGE: [Laughing] Ahhh yes, what some people refer to as "ring -2" if I remember correctly?

BRYAN: That's right, ring -2, and that kind of platform initialization layer, can stuff whatever it wants in SMM.

GEORGE: I still find it hilarious - I think it was HP? - tried putting like a day-calendar in the SMM, which is like... why are you doing this?! [Laughing]

BRYAN: They're doing it, because they wanted to add value to their hardware, without controlling the system software.

So the way to do that, is to jam that software into the system software they do control, which is SMM, but from the perspective of actually running this thing as a server, that's just a problem for me. I don't want to have ring -2. So for us, SMM is empty. Because the other thing is... why do you end up in SMM? For any reason! If you look at the architecture manual, it can go into SMM for any reason, can stay there for any length of time... it's unspecified, that you have to wait.

GEORGE: So, how do you solve this?

BRYAN: So for us, we do have something in SMM mode: if you ever hit SMM mode, we panic the system. Because, under no condition should we enter SMM. So if we entered SMM mode, we bring the system down and take a crash dump.

That would be pretty wild, if we ever saw that happen, right? Because it would mean that... but we have not seen that happen, and we wanted to do that to make sure, if something were to errantly enter System Management Mode.

But we didn't use System Management Mode at all, we also didn't want to have a BIOS at all.

GEORGE: Yep, so how are you getting around that?

BRYAN: Yeah, so that was tough, and in fact, this is something that we didn't really appreciate at the time... AMD didn't think we could pull this off, apparently Google tried this and failed. And if Google has tried something and failed it must be impossible for humanity?!

GEORGE: [Laughing] Well, oftentimes what they do is succeed and then claim that they failed, and then just, cancel the product.

BRYAN: [Laughing] That's right, and it was tough, and it required us to work very closely with AMD. I think that AMD didn't really believe that we could pull it off,

GEORGE: It's... I wouldn't even say it's not a trivial- it's a very complicated problem.

BRYAN: It is, because you are doing that *lowest* layer of platform initialization.

GEORGE: And that platform initialization, people forget, is like, [AMD] memory training, bringing up the PCIE,

BRYAN: That's right.

GEORGE: And, remember, what's bringing up the system? Well, oftentimes, like if you try and access a BMC, that BMC is on a PCIE bus, it has to be brought up and initialized, so there's a lot of complex problems with the BMC,

BRYAN: Speaking of the BMC, we also threw that into the sea!

So the BMC - Baseboard Management Controller - the computer-within-the-computer... we felt that the BMC had grown far too large, far too complicated. BMC should not be on PCIE, from our perspective. What you actually want is environmentals; you want power control, it needs to be on its own network, ... and that's basically it. Its job is really to hand the host CPU its cup of coffee.

GEORGE: I wish I had someone that hands me my cup of coffee!

BRYAN: So we eliminated the BMC, and we replaced it with what we call the Service Processor, the SP, kind of going back to an order model... so if you look at this compute sled here, and it may be hard to see in there, that's our Service Processor.

So this a ST Microelectronics part, and this is a part that is kinda funny because it doesn't need a heatsink, right? This is a 400 MHz part! Which is faster than machines were when I was coming up; like faster than the first workstation that I had at Sun Microsystems, by a long shot.

GEORGE: It's what, 80x faster than the original x86? [Ed: Original 8086 was 5 MHz, 400 MHz is exactly 80x faster - good memory and quick arithmetic!]

BRYAN: That's right. So it's like, why are we taking this kind of BMC and running this kind of multi-user operating system on it, when we actually have plenty of compute power there. We did our own operating system, we took a clean sheet of paper there as well. I think we were looking around for kind of, best-of-breed, but we weren't finding anything that we liked exactly.

One of the things that we were not finding is operating systems have this kind of multi-user heritage, where, they know how to load programs... which makes sense, absolutely. The idea that an operating system can load a program that it has never seen before makes it valuable, makes it usable!

GEORGE: I mean if you think about it, everytime you power a system off, and you reboot the OS; essentially the OS goes, I'm brand new, and then you go to let's say Steam for example... it doesn't know what Steam is.

BRYAN: Right, yes, exactly.

GEORGE: So the OS has to figure out the program, and boot it.

BRYAN: So even in microcontroller-based operating systems, they still had this idea of program loading. We wanted programs, but we don't want to load foreign programs on this; we want - all the of things that are in this, we want it to be aware of, when it actually boots.

So Hubris is our operating system-

GEORGE: I love the names Hubris, and then-

BRYAN: Humility is for the debugger. So Cliff Biffle, the engineer who pioneered Hubris - this is of course one of the Deadly Sins of Programmers - the idea being a nod to, oh my god you're doing your own operating system, the hubris of doing your own operating system! And then of course the debugger for that is Humility.

What's been interesting, is that that kind of model - and Cliff has a great talk at OSFC, my colleague Matt Keeter also did a terrific talk on some of the debugging infrastructure we've built on this thing - that model has allowed us to keep Hubris as a very tight image. So Hubris knows about all the tasks that it's going to run, when it actually boots; that image has every task in it. It does not load foreign programs, which is what you want in this kind of firmware.

GEORGE: Yeah, you don't want someone to be able to- even if you had physical access, ... could I, change that?

BRYAN: Great question, so if you had physical access, you could load a new image on here. But, then there's a Root of Trust on here, that Root of Trust would know that that image - unless you were Oxide doing it - it would know that image has actually not been signed by Oxide. So we actually test the image.

GEORGE: Now, can you... so let's say I get access somehow to a just single node, I only have time to mess with a single node. You have a single node in a big rack, could you, re-... essentially, download a new system image for that microcontroller?

BRYAN: You could create your own image, but it would know this is not an Oxide image.

GEORGE: Nono I mean, can it then pull an image, a known good image, off a different sled?

BRYAN: Ohhh, yeah, well no, you need to have enough to actually get over our Service Processor network, so you'd have to qualify just how crippled this image is. If you put a brick on here, it's going to be a problem...

GEORGE: [Laughs]

BRYAN: As a result, we're as a practical matter very careful about that, we've got, there are A and B sides to the microcontroller, so if you do put a bad image on it can rollback to the other one, and so on. This thing is really designed- it is fit to purpose, for booting and operating a computer. Unlike a BMC, which is really designed to make a server look like a desktop.

GEORGE: Yep. I think, well, we're running quite long here, but... one last question- always my last question, what's your favorite type of cheese?

BRYAN: Ooooh... that's a good question you know. I... um, God I mean I love a good sharp cheddar.

GEORGE: I agree with you on that one.

BRYAN: Actually, I also really like Swiss, not just Swiss cheese but Swiss cheeses, so a good Emmentaler, something like that, but my kids are less into that. I guess I'm pretty traditional in that regard.

GEORGE: I 100% agree. Well thank you so much, Bryan, for this, thank you for watching! Like, hit subscribe, do all that, comment... it really does help the algorithm, and we must appease the algo gods.

So, thank you so much!

BRYAN: Thank you!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.