Intel Alder Lake/Golden Cove CPU core unveiled (µarch analysis)

Jan Olšan

3 years ago

Fruits of wider core: 19% better performance per MHz

Intel’s Alder Lake CPUs are poised to be the biggest hardware event this year. Intel has unveiled the core architecture of these CPUs and we have analysed the details and new improvements inside. There’s promise of huge performance and one of the biggest architectural leaps in x86 processors, for the first time with six parallel decoders and further IPC increases, showing Intel taking the same path as Apple’s highly effective cores.

It has long been known that Alder Lake is moving to the “big.LITTLE” concept, combining together cores of higher and lower (but more energy-frugal) performance. Intel has released information about both the so-called “big core” and the “little core” which will provide boost in multi-thread performance (and second as a more energy-efficient core for running less demanding background tasks, etc.).

But first, let’s talk about the big core. Intel now officially calls it “Performance Core”, or P-Core (while it has been codenamed Golden Cove in the past), and its role is to achieve the highest possible single-threaded performance, a parameter that determines the performance of many common applications and also system responsiveness.

Golden Cove could finally satisfy those complaining Intel allegedly hadn’t introduced anything truly new in a long time, and all the individual cores from 2011’s Sandy Bridge are still more or less variations of the same. Golden Cove brings forth an architecture that is actually still based on their DNA (but so is the case with all architectures today), however its basic blocks have been redesigned and in all substantial parameters and the result is a much larger and higher-performing core. It significantly widens and deepens the out-of-order engine to reach higher IPC and the leap in performance should be very large. According to Intel, it even represents the largest upgrade of Core microarchitecture in the last decade, which would put Golden Cove (and Alder Lake processors) in a similar position as the mentioned Sandy Bridge architecture.

Golden Cove follows the concepts of the previous “big” 10nm Sunny Cove/Willow Cove core, it is functionally similar due to HT support, i.e. simultaneous processing of two threads on one core. This allows a core to improve performance in multithreaded code by making better use of the core’s computing resources.

The Golden Cove core should also achieve a similar if not higher clock as what we saw with the Rocket Lake or Tiger Lake chips, because it uses a more advanced Intel 7 manufacturing process (formerly known as 10nm Enhanced SuperFin).

Overall, Golden Cove increases and enhances most resources inside the core, and the architecture could somewhat be compared to the concept that Apple is pursuing, including increasing the number of ALUs in the core. However, the ALUs themselves are something that comes into the picture more towards the end of the working path through the processor core, so we will start the description of the architecture in the so-called Frontend instead, where the route of instructions/computing going through the core begins.

A core busting myths about the fundamental limits of the x86 architecture?

The Frontend is where the most fundamental or perhaps the most symbolic change took place: Golden Cove increases the core “width”, meaning how many instructions can be processed in parallel in one cycle, right here at the beginning. Since the Conroe/Core 2 architecture, Intel cores have had four decoders, so they could process up to four instructions per clock in the decoding stage. More than ten years ago, Sandy Bridge added the μOP cache that caches decoded instructions, which allows bypassing the decoders if the core finds the necessary instructions in this cache (this increases performance, but the second purpose is to reduce power consumption), but otherwise the limit of four decoders has remained—until now.

Intel has expanded the number of decoders to six in Golden Cove, so the core is now “6-wide” and will be able to decode up to six instructions per cycle. Only Apple has a wider decoding now, with the Firestorm core in the M1 and A14 chips already being 8-wide, while the Zen 3 from AMD is still 4-wide. Golden Cove’s ability to decode six instructions will probably have some limitations Intel has not revealed yet, where so-called complex instructions will have this bandwidth reduced, because they will be processed only by some of the decoders (“complex decoder”), while most (“simple decoder”) could process only more common instructions. Complex instructions are typically those that produce more than one μOP in further processing.

The price for extending the decoders is probably the extension of the pipeline by an extra stage, which probably helped to enable this extension (because the decoding was divided into more pipelined steps). This can be seen in Intel’s information that the penalty for misjudged branching has increased—from 16 to 17 cycles.

The x86 architecture will not allow more than four decoders…

This is also interesting because in the context of Apple’s ARM processors with very wide cores (but low clock rates), we have heard lots of opinions recently that a similar architecture is possible only with ARM and not in x86 instruction set processors, because their instructions are variable-length (and raise some other complications). Particularly confident of those opinions went as far as alleging that more than four decoders in x86 is near to impossible to have.

This topic is probably so popular also because where x86 cores differ from ARM, is exactly in the frontend part up until and including the decoding phase; after that the differences fade as decoded μOPs that are processed in the rest of the pipeline onwards can be pretty similar between ARM or x86. The “x86 limitation” in the decoding phase was therefore taken as a core argument for the notion that the x86 architecture has no future and there’s a need to move to ARM. However, it seems that Intel might now prove this point wrong by using six/6-wide decoders in the Golden Cove architecture. Although we still have to wait until we see the result and can be sure the x86 complications will not turn out to be relevant here after all. We’ll have to see whether, for example, the 6-wide leads to disproportionate power consumption.

Along with strengthening the decoding stage, Intel has extended the previous stage of processing, the so-called Fetch stage, which reads code (i.e., the instruction stream) from the instruction L1 cache of the processor and passes it to the decoders. Here, too, Intel had significant limitations for a long time, Fetch would for long only read a maximum of 16 bytes per cycle. Basically this has been the case since the Pentium Pro from 1995. Some x86 instructions can be very long (up to 8 bytes), so this could sometimes be a limitation even for a 4-wide core (but this limitation can be bypassed by the μOP cache togoether with the decoding limit). For the first time, Fetch has been doubled and delivers 32 bytes of code per cycle in Golden Cove.

It should be noted that based on software profiling, the performance of core (or IPC) in applications typically does not reach four instructions per cycle at all. Often the usual programs are closer to an average of one instruction per cycle than four or six. This is because of branching, waiting for data from memory, or because dependencies of one instruction on another make it impossible to execute multiple instructions at once. Therefore, only four decoders in previous CPUs were not necessarily a performance bottleneck as drastic as it might seem. But extending this stage will allow performance to be increased in those cases/moments in your code where the potential for greater instructional parallelism does exist. And to eventually raise the IPC above a certain level, this expansion likely has to come at some point—Intel has apparently decided that the time is right now.

Schematic view of client variant of the Golden Cove core (Source: Intel)

μOP cache, instructional L1 cache

Intel has kept the instruction L1 cache capacity the same as what the older cores (and also AMD Zen 2/3) have, it is only 32 KB—compared to 192KB capacity in the Apple Firestorm core. However, the L1 TLB (Translation Look-Aside Buffer) for instructions has been enlarged to 256 instead of the previous 128 entries and 32 instead of 16 entries for large (2MB/4MB) memory pages. Prefetching to the L1 instruction cache is also improved.

Hand in hand with the decoder improvement, the μOP cache for already decoded instructions has also been enhanced. Previous Intel cores (or AMD Zen/Zen 2) could supply up to six instructions per cycle from the μOP cache, which the Golden Cove expands to 8 μOPs per cycle (AMD Zen 3’s μOP cache has the same capability).

Same as its predecessors, the core architecture assumes that most of the time the execution path of instructions will not actually go through the decoders due to the necessary instructions (already decoded to μOPs) being instead found and sourced from the μOP cache. When operating from μOP cache, the decoders can be turned off (clock-gated) to save energy. According to Intel, the core should typically use the path through the μOP cache in about 80% of the execution time, so this is the primary case and the full-decode path is the less likely scenario.

At the same time, the capacity of μOP cache has also been increased, so the chance that the processor will find the instructions here and will not have to turn on the decoders (so-called hit-rate) increases. And the larger capacity allows significantly longer code loops to fit in the cache. While Skylake had a capacity of 1,500 instructions and Sunny Cove/Willow Cove/Cypress Cove increased it to 2,250, the Golden Cove core has increased the capacity to 4,000. AMD was a bit ahead in this regard, Zen 2 already has the same capacity and Zen 3 keeps the same capacity too).

The μOP Queue following the μOP Cache has also been slightly increased, with 72 instead of 70 entries for each of the two threads that the architecture can process (Golden Cove supports HT a.k.a. SMT, and one core processes two threads at once). What is new here is the ability to combine these separate queues for each thread into one with 144 entries if the processor processes only one thread instead of two.

The μOP Queue is followed by the Allocation and Rename phase, in which working registers are allocated for μOPs and the architectural registers are renamed (substituted for others from the physical register file) to eliminate conflicts. Also, this pipeline stage allows processing six operations (μOPs) per cycle, while previous Intel architectures could do a maximum of five. Here you can see how virtually all processing phases have been modified for higher IPCs in this core.

Performance should be further enhanced by Intel improving the processor’s ability to resolve conflicts at this stage and eliminating more instructions that do not need to be sent to compute units and are already resolved here. At this stage, the CPU eliminates MOVs (in Alder Lake perhaps it should work again, while Ice/Tiger/Rocket Lake had erratum and required a microcode update that turned this feature off) and other instructions that do not require the use of an execution unit. Another case of eliminated instructions are, for example, zeroing idioms such as XORing the same register in order to zero it (because this operation does not depend on the previous value of the register), the elimination of which the processors have been capable of for a long time. Golden Cove is capable of eliminating more instructions like this, but Intel does not say exactly how the capabilities have improved in this direction.

Alder Lake processor with eight Golden Cove cores and eight Gracemont energy-saving cores, schematic illustration (Source: Intel)

Significantly improved branching prediction

Before we look at the next part of the core, where out-of-order optimization of instructions takes place and then their execution, it is necessary to mention the branch predictor, which is in the frontend and operates early in the pipeline of instruction processing.

This part of the processor is always tuned continuously by microarchitecture engineers, so it should be improved in each new generation of CPUs (unless it is a “refreshed” silicon taking over core from the previous generation without changes). In Golden Cove, however, the prediction is said to be improved greatly. The accuracy (success rate) of predictors was improved by itself, but at the same time they also have much more memory available for their use, which should further improve the prediction, because the predictor has more data/information available. The L2 Branch Target Buffer (BTB) now has a capacity of 12,000 entries compared to 5,000 for the previous generation of architecture (for comparison, Zen 3 should have 6,500 entries).

The article continues in the next chapter.

As we’ve highlighted the advancement of decoders which symbolises how the Golden Cove architecture is going in the direction of widening the core, we can do similarly with backend execution units, although in their case the direction may not be as noticeable. However, the IPC-increasing policy is very noticeable in the size of out-of-order buffers and queues. The processor can hold much more operations in these structures thanks to their enlarging, so it will not stall during various delays (for example, when data is missing and needs to be loaded from RAM, or in cases of branch mispredicts) as much. These buffers also serve for optimizing performance by allowing CPU to perform operations in an order different from the program code order.

The latter is the very nature of what is called “out-of-order Execution” architecture: the core reorders the instructions it is processing in order to fit as much work as possible into a single cycle, trying to utilize as much of its execution units at once as possible. This reordering allows to exploit instruction-level parallelism in code even when data dependency between instructions would block a simpler “in-order” core from doing so because the following instruction has to wait for the output of the preceding one or because it has to wait for data being fetched from memory. out-of-order core looks ahead into the instruction stream for independent instructions that already have their inputs ready and executes those in advance. This greatly improves the ability to utilize four or six operations per cycle in these wide microarchitectures and is the main source of IPC increases in modern processor cores. In order to achieve higher and higher IPC through out-of-order execution, the CPU core has to be provided visibility into a larger and larger “window“ of code that it can look for independent operations and perform the reordering over. The size of this “window” is determined by the length of the so-called Re-Order Buffer (ROB) queue, in which the reordering is taking place.

It’s exactly the Re-Order Buffer that sees perhaps the second most significant increase in resources in the Golden Cove core, after the Fetch & Decode Stage power-up. And it is likely no coincidence that Re-Order Buffer depth is another thing that is exceptional in Apple’s ultra high-IPC cores. Intel is again going down the same path. While in Skylake the ROB was 224-entries (μOPs) deep and Ice/Tiger/Rocket Lake (the Sunny/Willow/Cypress Cove architectures) have increased it significantly to 352 μOPs, the new Golden Cove core comes with another large bump: it’s ROB now has 512 entries.

It has to be said that Apple’s microarchitecture is still quite ahead, it is believed to have up to 630-deep ROB (roughly). The relative jump in depth in Golden Cove is regardless very large and most spectators likely didn’t expect this much. Moreover, we can expect further deepening in the following updated microarchitectures in the future. This increased ROB depth can’t be had for free, it comes at the cost of more transistors and also more power usage. And due to that it is likely quite difficult to implement it in an energy-efficient way. Interestingly, AMD’s Zen 3 microarchitecture still relies on ROB that is merely 256-entry deep (while still achieving IPC on par or better than Intel’s Tiger Lake with 352-deep ROB—this shows you how CPU core performance is never determined by one isolated parameter only). But it is likely that AMD is going to increase the ROB “window” too, in their future microarchitectures.

Increased performance potential in the backend

As with the case of Fetch and Decode, the increase in ROB depth goes hand in hand with other increases, namely in the counts of execution units in the backend of the core. It makes sense to increase these resources together in balance, as the larger window is what allows the core to actually utilise more parallel execution units.

Intel keeps using their traditional scheme (that goes all the way back to their first out-of-order core, P6 a.k.a. Pentium Pro) where execution units are grouped behind a set of execution ports and the CPU scheduler in the core dispatches μOP to these ports. While execution units behind different ports can be used simultaneously, units behind a single port can not. The scheduler itself is a single unified one simultaneously serving ALU ports, Load-Store (AGU) unit ports and also SIMD/FPU instructions (whose units are located on the same ports used by scalar ALU units). Competing cores from ARM but also AMD have the core split into separate parts for scalar ALU instructions and for AGU units and a separate pipeline for FPU and SIMD operations that can have its own scheduler. Interestingly, this is what Intel actually uses in their “little core” microarchitecture used in Alder Lake (Gracemont), it also uses the split core design, while Golden Cove does not.

Golden Cove has increased the number of parallel ports to 12 from 10 used by the previous architecture (in Ice/Tiger/Rocket Lake; the Skylake core was using 8). This increases the number of operations that the CPU can simultaneously issue in a burst from its scheduler (but as we have seen, the previous pipeline stages are limited to 6 μOPs per cycle so it is not possible to sustain the 12 μOPs throughput in backend outside of bursts)

Five ALUs for the first time

The ALUs or (integer) arithmetic logic units are located behind ports 0, 1, 5, 6 and behind the newly-added port 10, which adds a fifth ALU to the four that have been present in past cores. This also makes Golden Cove the widest-yet processor core in the realm of x86 architectures. For comparison, AMD’s Zen to Zen 3 cores have only four ALUs, but Apple has famously used six ALUs for some time now.

All five ALU ports also support the LEA (Load Effective Address) instruction, which is very common in x86(x86-64 code. Its original purpose is memory use, but it can be exploited for doing arithmetics and compilers routinely emit it for this purpose. Golden Cove can do LEA on all five ports with 1-cycle latency, which means it is done by the next cycle.

While all five ALUs can do simple operations, more complex instructions have more limited throughput. Integer multiply is handled by only two ports, as are bit shifts and branching (JMP) is also handled by only two ports, which means Golden Cove can process two branches per cycle (same as Zen 3). Integer division is only provided by a single port/unit.

Stronger AGU subsystem for better memory performance

Load-store units or AGU (address generation units) perform reads and writes to the RAM, or more precisely to L1 data cache and they have also been strengthened in Golden Cove. Instead of four AGUs, there are now five (ports 2, 3, 7, 8 and the fifth has been added on the second newly introduced port 11). Three units perform Load operations and two are for Store operations, which means that Golden Lake supports three reads and simultaneously two writes to L1 cache. The maximum read and maximum write ops per cycle are the same as for AMD Zen 3, but there is a crucial difference: Zen 3 only has three AGUs and can only do three memory operations in total—three reads, or two reads and one write or one read and two writes. Besides AGUs themselves, the Golden Cove core has two separate ports for Store Data operations (ports 4 & 9), which is something carried over from previous cores.

It is however not just the number of read/write operations possible per cycle that we should look at, what is also important is the bandwidth. And Golden Cove provides up to double the bandwidth achieved by Zen 3. The regular client version of Golden Cove core that will be used in Alder Lake should be able to perform three 256-bit loads (256 bits corresponds to the width of an AVX/AVX2 register) from L1 cache per cycle, which gives the bandwidth of 768 bits/cycle, a 50% improvement over previous core. Server variant of Golden Cove with AVX-512 support will alternatively be able to perform two 512-bit reads (512 bits matching the width of an AVX-512 register) per cycle, allowing the core to feed its execution units with up to 1 Kb of data per cycle.

L1 cache itself was not changed in the manner of associativity or capacity, which is still 12-way 48KB as introduced in previous Ice Lake/Sunny Cove architecture. The memory subsystem performance was however boosted by other means. L1 cache fill buffers were increased from 12 to 16, L1 cache prefetching was increased and L1 data cache TLB was increased from 64 to 96 entries. All these tweaks should improve memory performance. The processor should also see better performance when it needs to fetch data from further levels of memory subsystem in the case of TLB miss, thanks to the change of the number of page walkers to four from previous two (Intel however is somewhat less aggressive than their competitor here, because AMD moved from two to six page walkers in Zen 3).

Further, Store and Load Buffer queues for the reads and writes to RAM (through L1 cache) have been deepened, but Intel is not disclosing the exact values. Another improvement is in the performance with which the core handles memory disambiguation (determining the dependencies between memory operations, which allows their faster execution out-of-order). The net result of these optimizations is lower effective L1 load latency, according to Intel.

The L2 cache has been improved too. Particularly it can now handle more outstanding cache misses (requests of data from further levels of memory hierarchy when they are not found in L2). While older cores could have up to 32 outstanding misses in flight, Golden Cove increases the limit to 48. This should allow one core to sustain utilization of higher absolute memory bandwidth from RAM, which is by the way another parameter, in which Apple cores are considered superior to AMD or Intel, possibly helping to their high effectiveness (IPC). Maybe Golden Cove will get closer, thanks to this. L2 cache should also have more effective prefetching, which also supports throttling based on feedback, so that it limits its bandwidth usage in cases where it is needed for other operations. There should be other improvements too, for example Intel mentions new Full-line-write predictive bandwidth optimisation.

L2 cache capacity is unchanged from Tiger Lake/Willow Cove at 1.25 MB per each individual core (which will however be a large increase compared to the desktop Rocket Lake processors that use smaller 512KB L2 cache—though it seems there has not been a large observed impact from this in Tiger Lake). The server version in Xeon “Sapphire Rapids” processors will add further 768 KB, ending up with 2 MB total in each core.

The article continues in the next chapter.

As already mentioned, due to not using split FPU design, the FPU and SIMD units (that handle not just floating-point but also integer instructions, despite traditionally being put into the same bag with x86/FPU ops) are fed from the same scheduler as integer part of the core, but furthermore, the FPU/SIMD units share the same ports with integer ALU units. The parallelism of FPU/SIMD units is weaker though, all the operations are only handled by three ports in total—ports 0, 1 and 5, same as in previous coes.

This use of just three ports might sadly somewhat limit the ability to improve the IPC in programs by interleaving SIMD and scalar integer instructions (a strategy that might be successful in ARM or AMD Zen processors). Less ports used to expose the underlying units also leads to more potential scheduling conflicts. For example: while it can be very useful that the Intel cores have the ability to perform two shuffles per cycle, this leaves just one more available port that could perform an additional operation in the same cycle. So while the Golden Cove is in theory a wider core, it is possible that Zen 2 or Zen 3 could achieve higher IPC (instructions per cycle throughput) for certain instruction mixes.

All the three ports contain vector ALUs for integer SIMD operations, while permutations (shuffles) are available on two ports (1, 5), shifts as well (0, 1) and there is just one floating-point divider. The core should be able to perform two or three 512-bit AVX-512 operations per cycle, depending on their type, if working with integer data types. When doing floating-point math, the maximum is two 512-bit FMAs or up to three 256-bit FMAs.

The FMA unit is on all three ports, but as in the previous cores (or at least the server variants thereof), the inner working is such that the the first two ports only have a 256bit FMA units that can do either an 256-bit AVX/AVX2 op per cycle individually, or they can couple to perform a single 512-bit AVX-512 instruction per cycle together. The third FMA unit on port 5 is natively 512-bit wide allowing it to perform one (second for the whole core) AVX-512 instruction per cycle. We do not yet know if this unit will be exclusive to the server variant of the core (Sapphire Rapids) or the HEDT derivatives, but it is a likely possibility. If so, the client core will not be capable of three 256bit FMAs per cycle but instead only two.

What is new in Golden Cove is that Intel has added an alternative FADD unit to ports 1 and 5, which does floating-point SIMD adds with a lower latency and particularly with lower energy consumption compared to the use of the heavyweight FMA units for this task. The processor should prefer these units when possible.

FP16 calculation support

AVX-512 instructions have gained a new ability in the Golden Cove core: to perform operations with 16-bit FP16 data type (where the total computation throughput should be double compared to using FP32). This on one hand further fragments the ecosystem of AVX-512 instructions user base with a new extension, but the FP16 ops should be highly useful for tasks like AI processing and perhaps even for multimedia processing.

This seems to be the biggest enhancement in the SIMD capability together with adding the FADD pipelines. Otherwise the theoretical computation capacity has seemingly not been expanded (unless Intel failed to disclose some additional secret changes), but the IPC of programs using them should still go up thanks to the improvements in the other parts of the core.

AVX-512 hard disabled in all Alder Lake chips?

The advancements in AVX-512 are in a way moot though. Why? As Alder Lake has embraced the big.LITTLE concept, AVX-512 support has to be disabled, as it is in conflict with the “little” Gracemont cores, which do not support these instructions. Without some special trapping/exception handling, any program attempting to use an AVX-512 instruction would instantly crash if it happened to be running on the Gracemont core. Even if that was solved, AVX-512 code would be useless in multi-thread applications, because they would be limited to run on just the big cores and it looks like in multi-threading, the little cores actually represent a very significant part of the overall throughput performance of Alder Lake. This means that a program would in most cases lose more performance by not being able to utilise the little cores than it would be able to gain from utilising AVX-512 on the remaining big cores.

For these reasons, Intel has decided to completely disable AVX-512 on the consumer version of Golden Cove core used in Alder Lake processors, despite this support being present in the silicon. The cores will only support AVX and AVX2. Furthermore, even if you disable all the little cores, this won’t allow you to turn AVX-512 back on, the support is said to actually be fused off. This is unfortunate for software developers, as previously there have been hints that for such uses, AVX-512 could be reenabled by switching little cores off. Intel seems to have decided to not allow this in the end, possibly to avoid confusion and extra complexity.

You will only be able to use AVX-512 in the server variants of Golden Cove employed in Sapphire Rapids and derived HEDT processors. Intel might in theory make special Alder Lake SKUs for the mainstream LGA 1700 socket that would not have any little cores enabled and thus could support AVX-512. But we think this is unlikely as this would look a bit awkward in Intel’s product segmentation scheme which the company likely wants to avoid.

Schematic view of server variant of the Golden Cove core (Source: Intel)

Many times before there have been complaints that it is Intel’s own politics that is the biggest hurdle in AVX-512 adoption. This latest step is the biggest self-sabotaging of this technology yet. Ironically, Intel is torpedoing the feature that it has presented as its competitive advantage before just two years after AVX-512 finally made its way into mainstream client CPUs with Ice/Tiger/Rocket Lake processors, now taking it away again. This will make it hard for software developers to employ these instructions as the installed base of capable computers remains small. Big.LITTLE turns out to be bad news in this regard. In yet another strange development, there are now some (so far unofficial) reports that AMD’s Zen 4 architecture might actually start to support AVX-512, just as Intel is stepping away from it. Competing AMD processors might actually become the go to product if you require the support of AVX-512, once Intel’s exclusive sales point.

Some further instructions set extensions that prior big cores used to support might have also been disabled in Alder Lake’s Golden Cove cores to make them instruction-set compatible with the little Gracemont cores. These additional sacrifices hopefully won’t be as problematic (though if you specifically used them for some optimization in your code, you no doubt won’t be amused).

Which however has been retained from AVX-512 even in the consumer client version of Alder Lake is the VNNI instructions. These have been reduced into a possibly somewhat less performant version employing just the 256bit YMM registers for AVX/AVX2 instructions instead of using the 512-bit ZMM registers. This allowed them to be implemented in the little core as well and thus this VNNI256 variant will be available on Alder Lake.

AMX instruction set extension for AI acceleration performs matrix multiplication over 1024-bit registers

Advanced Matrix Extensions (AMX)

The server variant of Golden Cove in Sapphire Rapids Xeons will offer one extra feature, besides its AVX-512 support. And that is the AMX instruction set extension, which will not be physically present in the client variant of the core (Alder Lake).

AMX is a set that performs 2D matrix multiply operations with a very high level of parallelism, thus achieving very high raw performance. However as it happens, this performance will be applicable to a limited amount of uses, basically these operations are useful for AI acceleration. You might think of these AMX units as a similar accelerator as tensor cores in Nvidia GPUs. Golden Cove has one AMX unit present in it, which is exposed on port 5.

The AMX instructions perform the matrix multiply using INT8 data type. The calculations are performed with 1024-bit registers which are new in this microarchitecture and the core contains eight of them (T0 to T7). One register should contain 128 of INT8 values and the unit will achieve up to 2048 calculations per cycle on one core with them. If we made a mistake somewhere or misread Intel, this means 2 TOPS of AI performance on one core at 1GHz clock speed or 8 TOPS at 4 GHz etc.

The article continues in the next chapter.

As to how great the jump in performance that Intel achieved with all these architectural changes really is, we should really wait for independent reviews to fully gauge them, as always. That said, Intel offers their own estimate based on internal testing, for what it’s worth. While official benchmarks can often be deceptive due to cherry-picking the best scenarios, in this case the numbers do look like they might be reliable. According to Intel, Golden Cove has been measured to have 19% better IPC, or in other words 19% improved performance when running at the same frequency as the previous architecture.

Note that this IPC value represents an average calculated from many individual measurements. Intel bases it on benchmark results from subtests of SPEC CPU 2017, SYSmark 25, PCMark 10, WebXPRT13 and Geekbench 5.4.1. The results were taken on a Golden Cove core of an Alder Lake processor, which should mean that AVX-512 is not being used in these tests and we are talking about the client version of the core with just 1.25MB L2 cache capacity. The IPC gains are calculated against a Cypress Cove core from a desktop Rocket Lake processor, which should have AVX-512 active but also has a smaller 512KB L2 cache. Intel has tested both processors/cores at fixed 3.3 GHz clock frequency. What is not said if the RAM was the same, as Alder Lake can use DDR5 memory while Rocket Lake is limited to DDR4. The IPC results could quite possibly be influenced by the missing AVX-512 support in Alder Lake. We think that at least some of the measured subtests might actually be benefitting from AVX-512 when running on Rocket Lake, which would make the +19% iso-frequency performance achieved by Alder Lake without AVX-512 support more impressive. It’s therefore possible AVX-512 enabled core (or on the other hand the client core tested against Rocket Lake with disabled AVX-512) could have demonstrated a higher IPC gain going over 20 %, that would be more in line with the large resource increases in the core.

Even the 19% IPC increase is quite substantial however, being a bit higher than the 18% boost Intel has announced for their previous big microarchitecture update, which happened in Ice Lake/Sunny Cove core. We should probably emphasize once again that the number is a composite average, because in reality, the IPC of a CPU core varies from task to task based on the code and workload’s characteristics. As you can see in Intel’s graph which actually shows the individual (but not annotated) sub-results, there is a distribution of different IPC boost (performance per 1 MHz gain) values. Golden Cove doesn’t bring much or even suffers a performance regression in select few exceptions (which might be for various reasons, it can be explained by for example changes in prefetch behavior that happen to not fit the individual program and similar factors, but perhaps more likely is that these cases show the effect of the AVX-512 removal, meaning the IPC drops are in code that does use AVX-512 on the Rocket Lake and benefits from it).

IPC improvements of Golden Cove core (Alder Lake CPU, 1.25MB L2 cache, without AVX-512) compared to Cypress Cove architecture (Rocket Lake CPU, 512 KB L2 cache, AVX-512 active) as observed by Intel

And on the other hand, there are also results that greatly outpace the +19% average gain. The graph suggests there are some programs where Golden Cove/Alder Lake gains up to 60 % more performance at the same clock, compared to a Rocket Lake CPU core. It is likely these results represent programs that happen to have been greatly limited by some bottleneck that Golden Cove alleviated or were limited by some resource that Golden Cove increased, like L1 cache bandwidth, number of load ops possible per cycle, or the 16B Fetch limit in past cores that has been raised to 32 B.

Gaming performance will be determined by more factors than just the CPU core’s IPC

What this +19% value represents is also merely application software performance. Our readers will likely often be more interested in gaming performance instead, but in that particular case, we sadly can’t assume similar gains just based on the application software IPC, performance when driving GPUs in high-FPS gaming can show quite different results. Gaming performance is a specific and complex case that depends now just on raw single-threaded performance of a core, even if that is an important factor for it and Alder Lake shows strong potential in that, but also on L3 cache capacity, cache and memory latencies and even other factors.

The ability to use high-frequency DDR5 memory might help and the highest Alder Lake CPUs are rumored to offer promising 30MB L3 cache, significantly more than Core i9-11900K’s 16 MB or Core i9-10900K’s 20 MB. Therefore, while we don’t yet have an indication on what gaming performance to expect and how much better it will be compared to the current generation of Intel Core processors, it is safer to assume there will be good gains here, both compared to Rocket Lake and versus AMD’s Ryzen 5000 processors.

Intel does claim Alder Lake should be the fastest gaming CPU and it is reasonable to hope or even expect it to achieve performance leadership in this area for Intel.

Performance in application software likely to be excellent

Regardless of games, the single-thread performance metric will be excellent with Alder Lake processors, that seems to be more or less certain. It seems clear that single-thread performance will end up on a higher level than what is achieved by fastest Zen 3 processors (AMD Ryzen 9 5900X, Ryzen 9 5950X), and the percentage gap could be considerable. Alder Lake processors seem poised to also significantly outperform Apple’s M1 processor in single-thread performance and in turn somewhat take the steam out of the narrative of all the “End of x86 is nigh” prophets.That said, the Cuppertino company might also field its own new processor architecture soon, which means we can’t rule out the possibility that Intel’s decisive win will be beaten back shortly after.

And this time, Alder Lake shouldn’t have just this high single-thread performance under its belt, but also offer high multithreading performance, which will rely on the addition of the “little” Gracemont cores. Alder Lake will offer up to 16 core configurations thanks to Gracemont’s help, although they will offer just 24 threads as the Gracemont cores won’t support HT. In a few days, we will analyse this “little” core’s microarchitecture as well, but we can already say that these cores are going to contribute significantly more multithreading performance than many observers were giving them credit for. It seems the “little core” monicker isn’t really fitting at all and thus performance expectations made based on assumption that Gracemont will serve a role similar to ARM Cortex-A55 in mobile chips understandably fell short.

In any case, both Alder Lake processors and the Golden Cove core microarchitecture look very promising. It seems that they will prove the opinions that Intel’s position as technology leader is merely a thing of the past and that the company lives on inertia and borrowed time while being in the process of inevitable downfall (and so on) to be mistaken. On the contrary, if Alder Lake confirms the impressive microarchitectural strength in practice, it will show that Intel’s processor portfolio will remain a relevant force to be reckoned with into the future.
Alder Lake seems to be on track to move the state of the art in PC processors forward significantly compared to where the technology has been a year before. And that’s actually exactly what we ask for.

Sources: Intel, AnandTech

Jan Olšan, editor for Cnews.cz

Back to: Golden Cove Architecture: highest-performance Frontend in the history of x86 cores