Site icon HWCooling.net

Zen 5: AMD’s biggest innovation since first Zen [expanded deep dive]

Higher performance and finally larger L1 cache; on SMT

It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs.

We have originally published our analysis of the AMD Zen 5 architecture on 19th of July. However, AMD has unusually disclosed the technical details in two steps, initially during the Tech Day event but then following with more thorough deep dive last week in a further press briefing which we had the opportunity to attend. That’s why we are republishing the write-up on Zen 5 again, this time expanded and revised with the newly available data on the new core. It doesn’t change our conclusion, in fact the new disclosures show yet more areas that AMD has reworked when developing the architecture.

Big upgrades to the core without changing the chip(let) area?

AMD has confirmed that Zen 5 uses TSMC’s 4nm N4P node for both the desktop Ryzen 9000 (specifically for its CPU chiplets) and the mobile Ryzen AI 300, or “Strix Point” APU. Strix Point is a relatively large chip with a die area of 232.5 mm² (Phoenix and Hawk Point predecessors based on Zen 4 architecture have a die area of  just178 mm²), which implies increased manufacturing costs.

Surprisingly, this isn’t true for the desktop version: the CPU chiplet with eight Zen 5 cores for the Ryzen 9000s hasn’t gotten any bigger compared to the Ryzen 7000 chiplet. It reportedly has a die area of 70.6 mm², while the 5nm Zen 4 chiplet is listed at 71 mm². The IO chiplet is still the same, a 6nm die with a die area of 122 mm². So it seems that the Zen 5 core itself hasn’t gotten much bigger, or AMD has only let it bloat by as much as the architects managed to save elsewhere by optimizing the design and using the slightly better 4nm node. This will probably further widen the already notable gap between the core sizes of AMD and Intel, whose P-Cores (performance cores) have a much larger die area than the cores of the AMD Zen lineage.

AMD Ryzen 9000 processor without a heat spreader, illustration (Author: AMD)

In interviews at Tech Day, one of the lead engineers (or now engineering team managers), Mike Clark, confirmed that Zen 5 is a new core in the Zen line of architectures that largely establishes a new foundation. So was Zen 3, but Zen 5 is is probably a case of a more profound change, as the long-maintained foundation built around a core with four decoders and four ALUs that held from Zen 1 through Zen 4 has been abandoned. Instead, Zen 5 brings a new broader foundation for future development (though it might not necessarily be used as long as the previous one).

Some of the investment in this core reorganization may really only bring back dividend in the follow-up future generations. It is possible that in some respects, Zen 5 might mainly be setting the stage, since the company’s resources have been focused on successfully creating this foundation and build a functional first generation on it, rather than including all the potential enhancements that could be added right away at this starting point. As an example, Mike Clark revealed that Zen 5 does not currently have the NOP instruction fusion capability that previous cores contain (allowing up to four NOPs to occupy only one “spot” in the processor pipeline and in RoB). Allegedly this feature would have slightly less benefit than it had in older, narrower cores, so the decision was made to leave its reintroduction for later.

The article continues on the next page.




It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ actual architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs.

Frontend: The biggest new feature of the core

What was known in advance about Zen 5 was that it would bring a significant boost to SIMD units which are getting their with doubled to 512-bit and also that there’s going to be a widening of the execution backend of the core. In the end, however, there are equally big changes to the frontend, which ensures that the execution units themselves have work to do, so in the core is significantly upgraded virtually in every part.

The most significant change in the frontend is that the core has duplicated blocks performing fetch and instruction decoding, not in the sense of simply doubling the bandwidth of the block, but that the core has these parts present in two parallel instances. The L1 instruction cache supports fetching program instructions from two locations (two instruction streams) simultaneously (the fetch handles 2× 32 bytes per cycle), to support this. Similarly, it is also possible to simultaneously fetch instructions from two locations from the uOP cache, which stores already decoded instructions.

These two instruction streams are consumed by two clusters of decoders. While all four previous generations of Zen cores have four decoders (i.e., the ability to decode four instructions per cycle), Zen 5 has two clusters each having four of these decoders, each capable of processing one of the two instruction streams.

(Author: AMD, via: ComputerBase)

This configuration can behave in one of two ways. The Zen 5 core still provides SMT unlike Intel’s latest architecture, so it can process two threads at once. In this situation, each thread gets one decoder, and thus has as many decoding resources as the previous cores had in total. However, AMD’s Mike Clark confirmed that both decoder clusters can be used by the core even when only a single thread is running (he also said that Zen cores are typically designed so that all the core’s resources are always available to a single running thread as well).

However, this capability likely has its limitations and the core is not able to involve both clusters as often as Intel’s E-Cores like Gracemont and Skymont, which also use similar technology. It is probably the case that a second cluster of decoders can get involved if there is a branch in the program and a program jump to another address is detected (predicted as taken branch). Thus, the second cluster of decoders does not have to wait for the decoders of the first cluster to finish their work, because it knows the place where it can start decoding, since the address is known to be start of an instruction.

It is sometimes claimed that there is one branch per six instructions in typical x86 code (on averaga), which would allow both clusters to be used relatively often, and so this addition may increase the overall IPC of the core by a bit. However, this probably has the effect that with the Zen 5 core, it is beneficial to compile code such that taken branches are the more frequently used patch during execution versus non-taken branches. This is something that may require profiling and recompiling of software to take advantage of, since in older architectures it is usually preferable if branches are mostly not taken, since earlier cores could handle fewer taken than non-taken branches.

However, the ability to use these two clusters of decoders simultaneously is likely to be improved in the future, so eventually the contribution to IPC is likely to grow. This may be one of those things where Zen 5 is building the infrastructure that will fully pay dividends in Zen 6 or 7 and so on.

(Author: AMD, via: ComputerBase)

The Zen 5 core has improved branch prediction to get better (lower) misprediction rate, which is something where improvements are made every generation. The prediction is not only supposed to have lower latencies, but also higher bandwidth – that is, the ability to process more branches per cycle.

2-ahead branch prediction

The branch predictor should actually be new, next-generation design based on the so-called 2-ahead predictor concept, and can handle conditional branches in zero bubble mode and can also handle two taken branches per cycle, which is a major upgrade. With the two taken branches, the prediction basically can go through three different branch prediction windows in code in one cycle. The L1 BTB (Branch Target Buffer) has been enlarged from 1.5K entries to massive 16K entries, TAGE is also larger. The core also received a larger 52entry return address stack, to improve performance during returns from function calls.

The decode phase is followed by the rename and dispatch phases, which have also been extended to handle eight instructions (operations) per cycle instead of six, so this is another part of the overall core widening. The very final retire phase handles 8 operations per cycle as well, so AMD doesn’t give it more capacity than the dispatch has, unlike Intel. However, the width of dispatch and rename (register renaming phase) is the same for AMD and Intel cores.

Smaller uOP cache might not actually have reduced capacity

Dispatch can be fed from the uOP cache in addition to the two decoder clusters. The uOP cache should have a capacity of 6000 entries, which is 50% more than Zen 2 and 3, but slightly less than Zen 4 (which apparently had 6750 entries). Up to 12 or 2×6 (already decoded) instructions per cycle can be served from this uOp cache.

Despite uOP cache shrinking, this may not be a regression. There’s an important change: whereas in Zen 4, one entry held one decoded uOP, in Zen 5 one entry holds the equivalent of an instruction, which can be fused uOPs. The uOPs are only split later for execution. This conserves the actual useful capacity of the uOP cache and also of dispatch bandwidth. It’s quite possibly this can make up for the drop in nominal entry capacity of uOP cache.

The associativity of the uOP cache was increased fro 12-way to 16-way too.

Still conservative depth of the Reorder Buffer

AMD has enlarged the Reorder Buffer (RoB), the main queue in which the Out-of-Order principle of instruction execution takes place and which forms an “instruction window” within which the processor can optimize execution by effectively reordering instructions. AMD has always had a relatively small RoB compared to Intel, let alone compared to Apple cores. Zen 3 had a RoB of 256 entries like Skylake (or Intel’s Gracemont E-Core, actually), Zen 4 increased it to 320 entries, while Intel’s Golden Cove already was at 512. Zen 5 will see a slightly bolder increase to 448 entries (+40%), but AMD is still more conservative here than all its competitors (Apple, Intel and ARM).

Zen 5 is now somewhere between the RoB depth of Sunny Cove/Ice Lake (352) and Intel’s Golden Cove cores, but it does look like AMD can develop cores with significantly more performance per 1 MHz with lower RoB depths than what Intel cores with simlar RoBs achieve. This may actually be fitting with the philosophy of “balance” between performance and efficiency that was touted as a guideline for Zen, back when the first generation was being announced in 2016 to 2017.

How many stages does the Zen 5 pipeline have?

AMD doesn’t actually share the pipeline (stage) depth of the Zen 5 core, which is something that wasn’t directly stated for most of the prior Zen cores either. So this will have to be derived by microbenchmarking analysis.

However, it seems that the pipeline may have been deepened by adding an extra stage somewhere. AMD confirmed that the misprediction penalty, which is usually and indirect guide to the number of pipeline stages in a processor, should on average be 1 cycle longer due to the changes made to the core.

The article continues on the next page.


⠀⠀

It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs.

More ALUs, stronger backend, unified ALU scheduler

As previously leaked, Zen 5 adds ALUs for the first time since the first Zen 4; there are now six ALUs instead of just four. Intel’s new core, Lion Cove, has the same number, but ARM or Apple are already at 9-10 (however, note the law of diminishing returns applies with increasing numbers of units). So the Zen 5 core can handle up to six common arithmetic logic instructions in a single cycle.

However, for some operations, the increase in resources may be even greater. In fact, in previous cores, three of the four ALUs were simple, so they could not do more complex operations like multiplication, which only one “complex” ALU could do. The two newly added ALUs in Zen 5 are apparently both complex units, as this new core can perform three integer multiplications per clock, a three-fold improvement. AMD has also increased the number of branch units, these are three instead of two (present presumably on the same ports as the simple ALUs). Thus, up to three jumps, or branches, per cycle can be processed.

AMD has increased the number of registers in the physical register file used for general purpose registers (for ALU operations) to 240, a slight increase from 224 in Zen 4. There is a far larger increase in the physical register file for SIMD (FPU) registers, where Zen 4 had only 192 registers, but Zen 5 has 384 registers. These physical registers are used to rename architectural registers (which are exposed to running programs) during the out-of-order and parallel instruction processing, and more registers give the core broader optimizing options.

(Author: AMD, via: ComputerBase)

AMD also made changes to the schedulers from which instructions are fed to the units. Whereas Zen 3 and 4 used four separate schedulers of 24 entries each – one for each ALU (and the AGUs or branching units associated with those ALUs), Zen 5 instead has a 88-entry deep unified scheduler for all six ALUs (and branching units) and a second 56-entry deep unified scheduler serving all AGUs (Zen 2 previously had a unified scheduler for AGUs). This is intended to allow more flexible allocation of instructions to ALUs, which Mike Clark says has become more important with the increased number of ALUs than in previous cores with only four ALUs. The new scheduler would be more efficient even with the same number of entries according to AMD.

There are four AGUs and thus four load/store pipelines compared to three in Zen 4. The core can perform up to four memory or L1 data cache reads (loads) to registers per cycle. Only two writes (stores) can be performed per cycle, and a maximum of four memory ops formed from a combination of writes and reads can be performed each cycle.

The article continues on the next page.


⠀⠀

It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs.

FPU: The most powerful SIMD on the market and the best AVX-512 implementation to date

Nonetheless, the load/store pipelines throughput information on the previous page applies to reads and writes to the basic arithmetic-logic part of the core. The FPU and SIMD unit, which has its own registers, is somewhat separated (this “coprocessor” solution is now also used in the new Lion Cove core by Intel, which for a long time had unified ALUs and FPUs units behind the same execution ports) and has the ability to perform only two reads or two writes. For reads, up to two 512-bit reads are supported (with 512 bits being the width of the AVX-512 instruction register) because the L1 data cache path widths have been doubled. Two 256-bit or 128-bit writes can be performed per cycle, or one 512-bit read or write operation.

This extension and thus doubling of data bandwidth between the core and L1 cache is of course done because AMD has upgraded the SIMD/FPU pipelines in Zen 5. These were 256-bit wide from Zen 2 to Zen 4, so they could execute an AVX or AVX2 instruction in one pass, but Zen 4, which could already do the 512-bit wide AVX-512 instructions, had to execute those in two passes. Zen 5 doubled the unit width to 512-bit. This means that while the number of SSEx and AVX/AVX2 operations the core can handle per cycle has not changed, the number of AVX-512 instructions the core can execute has doubled. So when using AVX-512 instructions in software, it will be possible to get more performance out of Zen 5.

The number of pipelines seems to have remained the same at six, but each can only handle a subset of all possible instruction and some of them perform operation like load/store or float-integer conversion. FMA or floating-point multiply instructions can be done by the core with throughput of two instructions per cycle (that includes 512-bit AVX-512), floating-point add (FADD) also run with two per cycle throughput. However, simple integer SIMD operations such as addition can achieve throughput of four per cycle. Another change for the better is that FADD instructions have a latency of two cycles instead of the previous three.

Up to six operations per clock can be performed when a combination of multiplies/FMAs, additions or logical ops and load/store operations are involved. So it’s not as if Zen 5 (and similarly Zen 4 before it) has six pipelines that are each more versatile.

(Author: AMD, via: ComputerBase)

However, given that the core can process a 512-bit vector for all SIMD operations, this means that the Zen 5 core should have the most powerful SIMD unit of all competitors. Intel has (or will have) four 256-bit pipelines in the Lion Cove core (which therefore only offers half the compute throughput), Apple has four pipelines with 128-bit width (and therefore only a quarter of the compute throughput), and ARM has six pipelines in the Cortex-X925 cores and the pipes are more versatile, but each only has 128-bit width.

Zen 5 should therefore be a very attractive core for SIMD-intensive tasks, and may well be the architecture with the best AVX-512 execution capabilities of any processor to date. But take note that this will probably only be true for the server and desktop versions, i.e. Ryzen 9000. While AMD hasn’t officially confirmed this yet, it seems that the mobile Ryzen AI 300 processors have had their core modified to include fewer SIMD pipelines in the FPU – they are still 512-bit wide, but fewer in number, which will obviously somewhat limit the performance in SIMD-heavy software. By how much, will be more clear once benchmarks are run on both variants of the core after release.

Update: Apparently the difference between the desktop (server) and mobile version of the cores does not lie in the pipeline count, but the width has actually been changed. AMD says that both the Zen 5 and the Zen 5c cores used in the Strix Point APU (Ryzen AI 300) actually use 256bit AVX-512 units in a fashion a bit similar to that in Zen 4 instead of full 512bit units that desktop and server Zen 5 and Zen 5c cores have.

This is enabled by Zen 5/Zen 5c being actually designed with selectable SIMD-unit width. AMD can select the width during implementation of the core for each particular chip. So while we stated that the Zen 5 introduces 512bit wide SIMD units, this is only true for some of the processors based on the architecture. Note that even for the 256-bit implementation, only the width of the FPU execution units is 256-bit, the load-store datapaths to the L1 cache remain 512 bit in all cores.

New instructions supported in the Zen 5 core (Author: AMD, via: ComputerBase)

It should also be said that the FPU/SIMD has its own scheduler(s). AMD continues to use split schedulers with 32 entries, and there are three (one for a pair of pipelines). But this is a change from Zen 4, where there were only two schedulers also with 32 entries (each one for a trio of pipelines), so the combined capacity of these queues is 50% larger in Zen 5

So here AMD went in the opposite direction, supposedly because code running on SIMD units is much more regular, looping and predictable compared to the more chaotic code that has to be crunched by the arithmetic-logic part, so it’s easier to handle it in split schedulers (for the same reason it was also easier to increase the number of SIMD/FPU registers by 100% while the general purpose register file is harder to boost).

In front of the schedulers, the FPU also implements an NSQ (Non-Scheduling Queue). This queue has also been deepened by 50% from 64 to 96 entries. The schedulers and queues that instructions pass through help hide instruction latencies, and in particular latencies that would be incurred by waiting for data that may be brought into the cache in the meantime due to prefetching.

The article continues on the next page.


⠀⠀

It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs.

Memory subsystem: More powerful and finally larger L1 cache

Another perhaps quite significant improvement is in the L1 data cache, which is a critically important cache providing a memory space for program’s working data that is the fastest and closest to the execution units (after registers, whose capacity is very limited). All previous cores in the Zen line had a 32kB L1D cache, but Zen 5 increases it for the first time, to 48kB, bringing it up to par with Intel’s current big cores (which have 48kB L1D since Sunny Cove).

(Author: AMD, via: ComputerBase)

While it is not as large as the 128kB L1 cache in Apple processors, it seems that it is more challenging to increase the cache in x86 processors that are using 4kB pages (or Intel and AMD don’t want to do it for some other reason). The L1 cache in Zen 5 still has a latency of 4 cycles, which is very good considering the achieved clock speeds. It has 12-way associativity. As mentioned earlier, the data width for communication is 512 bits, and data can therefore be filled in (or evicted from) the cache in 64 byte chunks. This data width means that the data bandwidth between the core and its L1 cache has doubled. If you see information somewhere about the L1 cache being “twice the speed”, that refers to this increase in bandwidth/width, not some other additional improvement.

The TLBs are 96-entry for L1 Data TLB which is fully associative for all page sizes, and 4K L2 data TLB fully associative for all page sizes except for 1G pages. AMD says the L2 instruction TLB has grown four times to 2048 entries.

The L2 cache remains at 1MB capacity, but given that it has only been upgraded to this configuration in the previous architecture, this is probably not a bad thing.  The associativity was changed fro 8-way to 16-way though, so there is a change here too, an the bandwidth between L1 data cache and L2 cache has also been doubled (64B/cycle) same as the bandwidth between load-store units and L1 cache.

The L2 cache is private to each core and should have a latency of 14 cycles. In retrospect, it may make sense why AMD made the move to increase the cache in Zen 4, even though that core was supposed to be just a less radical “evolution” of Zen 3 – presumably the goal was to get this task out of the way and done before it was the turn of all the big changes in Zen 5, so as not to bite off more than they could proverbially chew.

AMD also states that Zen 5 has improved prefetching, which, similarly to branch prediction, is a complex area of continuous optimizations where adjustments and tuning are made in each new generation, so it would be more strange if the core didn’t have improved prefetching. Prefetching is an important component of IPC of all processors. Zen 5 has a new 2D stride prefetcher and has an improved ability to recognize access patterns and prefetch data based on them.

(Author: AMD, via: ComputerBase)

L3 cache will be different capacity in various Zen 5 processors. AMD says it now supports up to 320 in-flight cache misses, meaning that the CPU cores can have up to 320 concurrent data requests active. This improves memory performance.

Static versus dynamic resource partitioning between threads in SMT

By the way, we mentioned SMT earlier. In the interview, Mike Clark mentioned which parts of the core are split in what way when the core has both threads active. A large part of resources has dynamic (competitive) partitioning, where the allocation for the two threads can change as needed, though usually some resources are always reserved so that one thread never consumes all the resources. This competitive partitioning applies to caches, TLBs, and also to the execution units themselves (pipelines in a SIMD unit or ALU).

However, some parts are partitioned statically, i.e. fixed to half and half if the core processes two threads. This is true for RoB, for example, where competitive partitioning is supposedly extremely hard to implement, so it has not been done. Other parts with static partitioning of resources (i.e. capacity or queue entries, for example) are the store and retire queues, and also the micro-op queue preceding the dispatch phase. However, the moment the second thread stops being used, these resources are again fully allocated to a single thread, so SMT does not degrade performance in a single-threaded workload.

Zen 5 and Zen 5c launched simultaneously

Like with Zen 4c, AMD says the efficient Zen 5c cores should have the same IPC (i.e., same performance per 1 MHz) as the big Zen 5 and support all the instructions and technology of the big cores. They differ only by optimizations that lead to smaller area on the chip (e.g. less use of custom logic macros, more density due to the fact that the core is divided into fewer partitions). These optimizations reduce maximum clock speed of the core at the same time.

(Author: AMD, via: ComputerBase)

In the Strix Point chip, the Zen 5c cores are about 25% smaller than the large Zen 5 cores, according to AMD. So they are not as efficient in terms of silicon area as Intel’s E-Cores. This may however be dependent on the Strix die design and not be true universally.

But the company says the dense design is aimed at improving power efficiency as much as reducing area. The Zen 5c is optimised for scalability, while the big Zen 5 classic is optimised for maximum single-thread performance. AMD says that the lower die area of the Zen 5c core itself is one of the factors improving power efficiency of the dense version of the architecture, since the signal paths within the core are shorter.

The article continues on the next page.


⠀⠀

It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs.

IPC increased by 16 %

AMD already gave an indicative figure for improvement in performance per unit of clock speed, or the so-called IPC, back at Computex in June. It’s supposed to be improved by +16%, which is a value averaged from a certain selection of applications. It should be noted that this characteristic varies from application to application, so don’t take “IPC” or this “+16%” as some fixed feature of the Zen 5 core.

The biggest improvements can be expected in tasks that can take advantage of the 2× higher compute thoughput of the 512-bit AVX-512 instructions, AMD gives as an example the AES-XTS subtest in Geekbench 5.4 (+35%) or unnamed machine learning tasks (+32%).

Where does the IPC come from?

By the way, AMD has shown this interesting chart that represents how much the various previously described core changes contribute to the resulting 16% improvement in IPC. The biggest impact seems to be the addition of ALUs and that SIMD expansion (which probably counts under this as well), which is the lighter grey in the graph.

Right behind is the new decoder design with two clusters and the related redesign of the uOP cache (darker ochre). While it has reduced capacity, its ability to deliver up to 12 instructions per cycle (and also from two places in the code simultaneously to meet the needs of two decoder clusters) clearly has a decent performance benefit. Note: It is possible, however, that this represents multithreaded applications, not just single-threaded ones.

A very big (practically the same?) role seems to be played by the doubling of the L1 cache data bandwidth (dark grey), which is a bit surprising. From how big the benefit is, it seems that this doesn’t help just code exploiting AVX-512 operations on the SIMD unit, which is the main benefit you’d think of. Only fourth in line (the bright ochre) is the impact of improved fetch and branch prediction, but it’s still a pretty big impact.

Really the biggest change in the core since the first Zen? It definitely is

As you could see, the extent of the changes in the core goes really deep. For example, the reorganization of the ALU and FPU schedulers shows that engineers have been at work on things that are not as visible as the long-coming increase in the number of ALUs by 50% or the 2× compute throughput of SIMD instructions (and cache bandwidth).

The frontend overhaul (of the decoders and the associated changes to the instruction cache and fetch) may end up being the most far-reaching thing in Zen 5, as development in this part will probably continue into the next generations shaping their design, and it’s probably the biggest conceptual change in AMD’s CPU core we’ve seen since Zen 1 (Intel fans may be proud that this first appeared in its Tremont core, but the idea may be older).

We certainly can’t consider Zen 5 to be some kind of “mere refresh” (you can actually find such comments this on the internet). On the technical level, not only it is a new core, but definitely a “major new” architecture too.

So why is the increase in IPC just as much as it is, you may ask. The question is rather whether the number given is actually small, which is probably what a lot of people think. As we wrote at the beginning, perhaps sometimes the priority was mainly to reach those “milestones” like wider data paths, more ALUs and a redesigned frontend rather than to squeeze out all the potential out of them right away. That rarely happens anyway, one need only point out how relatively low the IPC of the Zen 1 core started and how far Zen 4 got with the same four-ALU base.

Further performance leaps are also likely to continue in future generations in this case, and according to the architect Mike Clark, many improvements in future cores will be partly something that has already had its foundations laid down in Zen 5.

It’s possible IPC increases will keep slowing down from now on

Beyond that, though, it’s worth pointing out that increasing performance per 1 MHz on a CPU core is not something that can be done forever at the same rate as scaling up a GPU to more and more blocks and shaders – GPUs process highly parallel tasks, and thus can scale up somewhat easily. But scaling up performance per 1MHz of clock speed is something where the law of diminishing returns almost certainly applies, and if you’re already at the cutting edge, further progress becomes harder and harder. Probably the same reason why rival Intel also claims “only” a similar IPC improvement for their own big CPU core upgrade (the Lion Cove architecture promises +14%, however don’t compare this number to Zen 5, it’s measured in a different way on different applications, so Lion Cove may not actually have a smaller IPC increase than Zen 5).

Comparing structures and throughputs of various Zen 5 blocks against Zen 4 (Author: AMD, via: ComputerBase)

It’s true that the absolute IPC of Apple’s cores is still quite a bit higher, but they still reach significantly lower clock speeds in return, and the two things are related. With the design aiming for a higher clock speed, Apple’s core development would have arrived at a lower IPC, and similarly AMD’s cores would have had a better IPC if they kept designing for lower clock speeds. Future developments are likely to be such that both the clock speeds and IPC will converge for both competitors.

The law of diminishing returns, or the fact that further increases in IPC are becoming more and more difficult, is also evident at Apple. Although the company is rightly considered the “king” of high processor IPC, if you look at its incremental growth recently, there is also a significant slowdown. Measurements vary from one another, but it looks like Apple’s core IPC has only increased by about a tenth since the M1 processors of 2020, if we compare it to the now-current M4 processor (much more performance was gained by increasing clock speeds). This is despite the fact that both the M3 processor (which were apparently launched more than a year late compared to the original plans) and the M4 processor have a new core architectures (which didn’t seem to be the case with the M4 at first). And so for that 10% improvement in IPC, Apple actually needed combined gains of these two new architectures together.

We do not know for sure yet whether this is a temporary slowdown and new techniques and approaches in future architectures might bring (for a time…) higher generational advances. For now it appears likely that we are underestimating the significance of Zen 5’s 16% gain (and the can be said in Intel’s Lion Cove defense). However, it’s true that we’re still waiting for independent tests to verify or revise this number a bit, but all in all, it wouldn’t be bad progress.

Sources: Chips and Cheese, AnandTech, HardwareLuxx

English translation and edit by Jozef Dudáš