Zen 4 architecture: chip parameters and IPC of AMD’s new core

Zen 4's IPC is actually better than what AMD originally said

AMD formally launched the Ryzen 7000 (with availability four weeks away though) and confirmed specs and pricing. The company has also disclosed a considerable amount of technical details and more has now leaked from unofficial sources. So let’s have a deeper look at Zen 4 architecture, the 5nm chiplets of the processors and finally the core’s IPC – which has actually improved more than the previously promised +8 to 10 % over Zen 3.

5nm chiplet Durango

As has been known for some time, Zen 4 is manufactured on TSMC’s 5nm process (N5) – more specifically, this technology is used to produce CPU chiplets, which are relatively small silicon dies containing the Zen 4 cores and their L3 cache. These are then connected to a 6nm central IO die. Up until this point, we’ve known the entire Ryzen 7000 under the codename “Raphael”. But the CPU chiplet itself has reportedly its own codename (Durango) at AMD, so you may see that referenced in the future. The Zen 4 CPU core architecture then has the codename “Persephone”.

The Durago chiplet is more than 10 % smaller than the Zen 3 core chiplet in Ryzen 5000, occupying an area of just 71 mm², according to AMD. But it contains 6.57 billion transistors, 58 % more than the Zen 3 generation chiplet (which was claimed to have 4.15 billion transistors in 83.74 mm²). The density of transistors in the 5nm part of the processors is therefore on average over 90 million per mm². AMD is said to use 15 metal layers that are optimized for both high frequencies and the ability to achieve high transistor density.

The increase in the number of transistors is very high when you consider that the chiplet still has the same number of cores (eight) and the same L2 cache capacity (32 MB).

AMD does not consider Zen 4 to be a “big”, all-new architecture like Zen 3 was – instead, Zen 4 is an “incremental update” to Zen 3, an evolution and enhancement. The next new architecture introduction isn’t supposed to be until Zen 5, which is reflected in the CPUID – Zen 5 will reportedly be Family 1Ah, while Zen 4 has the same Family 19h ID as Zen 3. The biggest change is in the processor frontend (Mark Papermaster even talked about a “new frontend design”), which is designed to ensure that the computing resources already present in the backend are utilised as efficiently as possible and the time they remain unused is minimised.

AVX-512 on 256-bit units, larger L2 cache

A significant number of transistors was likely used to support AVX-512 instructions and their 512-bit registers, which Zen 4 is the first AMD architecture to support. It is implemented by double-pumping, basically using two passes through the existing 256-bit SIMD units that the Zen 3 core already had. So you shouldn’t expect the dramatic performance increases that would be possible possible with native 512-bit units. However, performance may increase where the new features of the AVX-512 instruction set are utilized, as this extension has numerous advantages that make it easier to write and optimize many algorithms. According to AMD, AVX-512 execution on top of 256-bit units has one advantage: there will be no underclocking of cores when executing these instructions, which is something that has plagued Intel’s implementations a lot.

Snapshot of Zen 4 core with L2 cache, area comparison with Goldmont core of Intel Alder Lake processors (source: AMD, via: AnandTech)

The L2 cache in the cores has increased from 512 kB to 1MB, which also increases the occupied area a bit, but the cores are still smaller overall than Zen 3 on 7nm thanks to the 5nm process. The area including L2 cache is 3.84mm², some 18 % smaller compared to about 4.11mm² area of Zen 3 on the 7nm process also including its 512 kB L2 cache.

Increased latency

The increase in L2 cache capacity reportedly (according to Retired Engineer) required an increase in latency by two cycles – it went from 12 to 14 cycles. Part of this may have been a trade-off made for the ability to hit higher clock speed. That’s because the increase in L2 cache in Intel Raptor Lake processors (from 1.25 to 2.0 MB) required only one extra cycle of latency.

Similarly, Zen 4 also reportedly has longer latency on the L3 cache, which is shared by eight cores within the CPU chiplet. Here the increase was from 46 cycles to 50 cycles. In both cases, however, it is possible that the actual latency in nanoseconds is the same or lower because Zen 4 runs at much higher clock speeds where each cycle is shorter.

Two-thirds larger uOP cache

A relatively large transistor investment seems to have gone into the uOP cache (microOp cache), which is used to cache already decoded instructions so that the processor can process them again without utilization of the power-hungry instruction decoders. AMD already had a very large uOP cache for 4000 entries since the Zen 2 architecture (Zen 3 had not changed this capacity), Zen 4 will increase this capacity to over 6000 entries.

According to data revealed by the Retired Engineer Twitter account, the capacity is even said to be 6,750 entries, which would be an increase of +68.75 %. The higher capacity of the uOP cache should mean that a larger percentage of executed instructions will be covered from its contents, or that larger loops will fit in the cache.

Zen 4 seems to have an unchanged configuration of the execution units themselves (still four ALUs in the core), but instead of expanding the “width” and compute resources in the backend, AMD has focused on making the frontend of the processor better able to utilise the potential of existing resources – and thus achieve higher performance. The larger uOP cache seems to have exactly this intention.

ROB has grown to 320 entries

AMD has also reportedly increased the size of the Reorder Buffer (ROB), a very important queue in which the processor changes order of execution of ops. It is therefore the heart of an out-of-order architecture. Zen 3 had a curiously small ROB with only 256 entries, which is on par with Intel’s small Gracemont core, while the large Golden Cove core uses an ROB 512 entries deep. Zen 4 is said to have finally enlarged this structure significantly, but it’s still an increase to relatively modest 320 entries (less than Intel Sunny Cove/Ice Lake, where it was already 352 entries).

The larger the ROB, the larger the “window” of the code being processed the processor sees at a time and the more efficiently it can rearrange instructions and optimize the dispatching of instructions to the units so that the total number of cycles required for processing is minimised. So deepening of the ROB is precisely about making the most of the computational units available to the backend of the core. Thus, the larger ROB of Zen 4 should itself improve the IPC (performance per cycle) somewhat. However, this is still unofficial information, so take it with a grain of salt until it is directly confirmed.

Retired Engineer mentions one more change in Zen 4: L1 BTB increased from 1k to 1,5k items. This should be part of the improvements acheved in accuracy of branch predictors.

IPC better by 13% in the end?

AMD previously announced that it expects about an 8 % to 10 % increase in core performance at the same clock speed (IPC) compared to Zen 3. However, back in May and June, AMD was probably not so sure about the final performance, so they probably shared more conservative numbers. It was supposed to be a number based on performance in SPEC CPU, Geekbench, Cinebench, for example, but it’s possible the used scores were measured on firmware that was still not fully optimized at that time, with preliminary or unoptimised cache, prefetcher and memory controller settings and so on.

Zen 4’s IPC is reportedly 13 percent better on average than Zen 3 (source: AMD, via: AnandTech)

AMD seems to be more optimistic now, as the company stated that Zen 4 has surpassed these previously promised performance levels and in the end has an IPC 13 % higher than Zen 3, i.e. 13 % better performance per 1 MHz.

This +13 % is the geometric mean of 22 different programs, the composition of which you can see on the slide bellow. As you probably know, the IPC of a processor is not constant and varies from program to program depending on the nature of the code, which is why different programs have different performance changes per 1 MHz.

Zen 4’s IPC increase in application software ranges from 4 % (Adobe Lightroom) to +15 % (V-Ray), but there are exceptions – wPrime 1024M performs up to 39 % better at the same clock speed, which could be due to the AVX-512 instructions. Conversely, the single-threaded CPU-Z test showed an IPC improvement of just one percent, and will therefore be a task that will give scores on Ryzen 7000s that look relatively poor given the typical performance of these CPUs.

IPC increase between Zen 3 and Zen 4 in various programs and games (source: AMD, via: AnandTech)

AMD has also included performance in games in the IPC calculation (which is a bit controversial, as they heavily depend on memory rather than just the core, but not entirely without justification). The variance in games is also considerable. For example, in Fortnite, IPC is only 3 % better, but in Watchdogs Legion it’s 24 % better. And again there is a significant outlier, with Dolphin Benchmark (Nintendo’s Wii and Gamecube emulator) measuring a 32 % higher IPC. But these outliers should perhaps not affect the overall calculated IPC unduly, because AMD made sure to use geometric mean to calculate the average.

These figures were measured on eight-core processors, the Ryzen 7 5800X and Ryzen 7 7700X, both locked at 4.0 GHz. They were using DDR4-3600 CL16 and DDR5-6000 CL30 memory, respectively (i.e. OC beyond the officially supported frequencies using XMP and Expo profiles), Radeon RX 6950 XT graphics, and running Windows 11.

The sources of this IPC improvement are mostly just the modifications to the frontend – these, together with the effect of a better branch predictor, account (according to AMD) for almost 60 % of the 13 % improvement. These are followed by changes to the load/store pipeline, but AMD has not yet given any details about these anywhere (they should have a bigger impact than the branch predictor in isolation, which is itself the third biggest factor). The impact of the 1 MB L2 cache is only the fifth factor in order of significance (unspecified changes in execution units in the backend are ahead of it in fourth place).

Factors behind the IPC increase in Zen 4 (source: AMD, via: AnandTech)

High clock speeds

However, IPC is not the only performance enhancement strategy for Zen 4. This architecture (and probably partly the 5nm process) has massively increased the achieved clock speeds. Single-core processor boosts reach up to 5.7 GHz, an 800 MHz increase over the 7nm Ryzen 5000 processors with Zen 3 architecture (the Ryzen 9 5950X officially has a boost of 4.9 GHz). So the resulting increase in performance per core is quite significant, AMD’s presentation even promises that in the best case, the performance increase in single-threaded applications can be up to +29 %.

Ryzen 9 7950X to deliver up to +29 % better single-threaded performance than 5950X (source: AMD, via: AnandTech)

This specific increase should be the improvement achieved by the Ryzen 9 7950X with DDR5-6000 CL30 against the Ryzen 9 5950X with DDR4-3600 CL16 in the Geekbench 5.4.x single-threaded benchmark test. This test gives a relatively low score on the Ryzen 5000 relative to overall performance of their cores, which probably contributes to the fact that you can see such a big single-threaded improvement with Zen 4. So in other programs we may see less of an improvement.

By the way, AMD listed single-threaded Geekbench 5 performance data for all four models of the Ryzen 7000 series in the presentation. The company showed quite a few different benchmarks, but it’s probably better to wait for independent reviews to come out, ideally in larger numbers, to get an idea of performance against older CPUs and current competing Intel Core (Alder Lake) processors. The scores from Geekbench 5 are interesting, however.

The Ryzen 9 7950X reportedly gets up to 2275 points in the single-threaded test, which would be well above the capabilities of both the Core i9-12900K and Core i9-12900KS. The 7900X, 7700X, and 7600X aren’t expected to be too far behind (2250, 2225, 2175 points, respectively). However, this is when tested with fast memory (again, DDR5-6000 CL30). So Zen 4 will be very strong in this test, which is also very popular, for example, for comparison with Apple processors (the M2 processor has a single-threaded score of 1919 points).

AMD releases official measured results for Geekbench 5.4.x ST benchmark (source: AMD, via: AnandTech)

In multithreaded applications, the intergenerational increases are likely to vary from model to model. For the top model Ryzen 9 7950X with 16 cores, AMD quotes performance increases for all-thread utilization ranging from 32–37 % (Corona Render, Autodesk Arnold Render) to 45–48 % (V-Ray, POV-Ray). In games, increases can reportedly range from as low as +6 % to as high as +32–35 %. Again, it will be better to wait for results from independent reviews, as these official numbers may be cherry-picked.

Official AMD Ryzen 9 7950X benchmarks (source: AMD, via: AnandTech)

High power and heat density in a small area

High clock speeds will also bring increased power draw, when compared to Zen 3. A significant increase in power draw, which will probably be mostly consumed in the CPU chiplets and the resulting heat will have to be cooled via a small 71mm² contact area. The heat density per 1mm² of CPU chiplet area seems to have increased considerably.

This means one thing – Ryzen 7000 won’t be easy to cool down and these processors will probably often reach high temperatures. CPU temperature under load depends not only on power draw itself, but also on the rate at which heat can be dissipated into the heat spreader. And it is the small contact area for the heat transfer that will probably be the most limiting factor with the Ryzen 7000.

Read more: How to apply thermal paste on Ryzen 7000? Noctua is clear on this

So although Intel’s competing Raptor Lake processors seem to have worse absolute power draw than Zen 4, it will probably be the Ryzen 7000s that will reach worse temperatures and be harder to keep cool. Raptor Lake die apparently has a large area of perhaps as much as 257 mm². So the ratio of watts per unit of area will be more favourable despite the higher power draw – this is something that won’t help you against an increased electricity bill, it will make it easier for your cooler to keep the temperature of these CPUs lower (as long as their total thermal capacity is not overwhelmed by the wattage).

Ryzen 7000 models and pricing (source: AMD, via: AnandTech)

Sources: AMD, AnandTech, Angstronomics (1, 2, 3), Retired Engineer

English translation and edit by Jozef Dudáš

Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *