Gracemont, the (not so) little Alder Lake core (µarch analysis)

Jan Olšan

3 years ago

Execution units: 17 ports and high IPC potential

Intel has revealed the Alder Lake CPU architecture, or actually two architectures this time. The CPUs are hybrid and besides the main „big“ ones, there are „little“ cores called Gracemont. These are not just for marketing or for low-power idle tasks like in mobile ARM SoCs, however. Gracemont should significantly add to the overall performance, the architecture is actually surprisingly beefy. Our analysis will show you more.

The so-called „big.LITTLE“ scheme has been subject to much criticism or at least scepticism ever since the first news suggesting Alder Lake is going to be like this surfaced. There are prevalent opinions that it will hold the processors back, at least in the desktop PC segment, if not everywhere else too. We have to wait for independent reviews upon launch (or perhaps even a bit further than that) to actually get definitive answers to this question, but even the on-paper characteristics of the Gracemont core microarchitecture are quite a significant sign that things won’t quite go the way detractors have generally expected. One thing particularly stands out: the whole debate might have been thrown off the right track from the start by the use of the „little core“ naming. Because it seems that Gracemont doesn’t quite fit the concept of a little core and it’s not a good way to call it. Unsurprisingly, this itself might be a large contributor to our expectations being off.

Indeed, Intel doesn’t call Gracemont „little core“ (that is actually 100% unofficial parlance) and instead uses the term „Efficient Core“ (E-Core), meaning a core that has high efficiency—which should refer to energy efficiency and also silicon area efficiency. It might be beneficial if we just dropped all our preconception and notions of big.LITTLE for now and even ignored Gracemont’s lineage, derived from the past Intel little cores that have been bearing the Atom brand, for a moment.
Based on what Intel has presented about Alder Lake, its hybrid architecture using two types of cores is actually envisioned as a solution to a problem that is not actually something stemming from the mobile area, but a more general problem of optimal sizing of processor cores for various different types of software tasks—an issue that is very relevant to desktop processors. This might mean that any mobile area benefits Alder Lake brigns, like longer battery life, could even be merely secondary benefits rather than the primary goals of this architectural decision.

Solution to dilemma between best ST or best MT performance?

What is actually this problem we mentioned? It’s like this: there are various demands we put on a high performance processor. One of the most important ones is performance while executing a single thread, a.k.a. single-thread (ST) performance. This ability is crucial for many different desktop and notebook tasks, but it also contributes to performance in games and to responsiveness of operating system’s GUI, apps, pages open in web browsers and so on. However, this performance characteristic is not the only one. Besides it, there is also the similarly very important multi-threaded (MT) performance—the amount of performance the processor is able to provide for programs that are able to utilise all its cores and threads simultaneously. Multi-thread tasks tend to be very heavy and often can be long running, which means that any performance boost can lead to appreciable savings in your time, which single-thread performance, while possibly improving your experience and comfort, might actually not. The whole issue is that it just so happens that a processor core designed to maximally excel in the first single-thread metric can very easily end up not being optimal for the second, multi-thread performance metric. When you design your cores to be best in ST, you might end up harming your MT ability, and vice versa, optimising for MT might result in cores weaker in ST performance.

On a side note, this dilemma has been explicitly acknowledged by ARM, which actually does try to solve it in its portfolio of CPU cores by providing two different designs for each task, into which the company has split its previously unified big core design. After Cortex-A77, the company started to provide a more efficient Cortex-A78 and now A710 core microarchitectures that are more MT-optimal, and on the other hand heavier Cortex-X1 and X2 cores that prioritize ST performance at the cost of MT performance ability.

For a core targeting maximum single-thread performance, it makes sense to make many compromises that trade in relatively large jumps in transistor count, complexity and power usage for relatively smaller gains in overall performance. For these cores, it could be perfectly reasonable to accept a +30% power consumption and silicon area hit to gain an extra +10% performance increase. This is because single-thread performance is highly coveted and also, a program running just on a single thread does not hit TDP limits for the whole CPU in most processors (it actually might do that in sub-15W chips and in ~5W phone SoCs, though).

Alder Lake procesor with 8× Golden Cove and 8× efficient Gracemont cores (illustration)(Zdroj: Intel)

However, when you want to maximise multi-thread performance, you will often see that the ST trade-offs hurt you. Multi-threaded performance is usually bound by the power consumption it is allowed to reach (TDP). The architects basically have to reach the best MT performance (which means the best possible combined performance across all cores/threads simultaneously running) they can within the envelope of a limited wattage. But this is not compatible with spending 30% more power for +10% per-core performance, because that would actually lower the performance you get per watt. For maximising MT performance, you need solutions that scale performance in balance with power and actually bring more extra performance than what they cost in power. A good strategy might be running your cores at lower clock and thanks to that at lower voltage than what you would run them at for ST tasks—because that increases efficiency (performance per watt)—and compensating by making use of more of these cores. However, to be able to increase the number of cores, you might be forced to cut-down their silicon area footprint (number of transistors), because extra chip size costs money and you inevitably run into your silicon budget limits (and related constraints your management will put on your design’s manufacturing costs).

Due to these factors, the best design of core for purposes of achieving MT performance is such an architecture that doesn’t actually shoot for maximum single thread performance and is more relaxed about it: for example a core that sacrifices the last 10% of possible single-thread performance, but in return is able to reduce the power consumption and silicon area footprint by 30%. With that, such cores can, given a certain count of them in the chip, achieve a higher overall multi-threading performance in the envelope of a particular TDP budget they are given (be it 35W, 65W or 125W).

ARM’s presentation slide from Cortex-X1 and Cortex-A78 launch outlines different CPU design strategies (Source: ARM)

In the case of ARM processors, this latter strategy is what governs the design of the Cortex-A78 or the current new A710, cores designed to be power and area optimal, while still belonging in the category of big cores. Thes first strategy that maximises ST-performance at the cost of worse energy efficiency and suboptimal core footprint in the chip has produced the Cortex-X1 and X2 cores.

It seems that in the case of Alder Lake, the Golden Cove can be likened to the Cortex-X line, and precisely like in the case of Cortex-A78 or A710, the Gracemont core is not really a little core, but at least a “medium”, if not a “smaller big” core. Intel’s choice to call the cores Performance Core and Efficiency Core instead of the big and little terminology reflects this, too.
The actual goal of the hybrid architecture then is to do away with the described dilemma and to not have to pick whether the processor will use cores that are ST-friendly or cores that are more MT-friendly due to being smaller and more efficient.

With its hybrid combination of cores, Alder Lake and its successors have the advantage that they can mostly do both at once. The processor will contain a certain number of the big performance cores that are optimised for maximum single-thread performance to cover the need for that for single-thread tasks. However, it will be just this limited number of cores that will have to bear the efficiency hit, because the rest of the cores will use smaller efficient microarchitecture that is optimised for multi-thread performance (this is also called “throughput” sometimes, or “scalability”). The addition of these cores improves the combined MT performance of the processor achievable within a given TDP limit, but also within a certain die size limit. The ST performance is not needed to be the very best on all of the cores but merely on a few that will be used for the relevant tasks and for games—as long as the operating system is clever enough to schedule such programs to the right cores. It might therefore be advantageous to actually use the little (or rather, efficient) architecture for most of the cores in the processor. It seems this actually might be exactly what Intel is planning. While Alder Lake uses an 8+8 combination, the leaked rumors suggest that future generations will shift the balance towards a much bigger share of efficient cores: we might see CPUs with 8 performance and 16 or later even 32 efficient cores.

The goal of the Gracemont core is to achieve as efficient architecture as possible while still keeping high performance, with small footprint that allows scaling to high core counts. At the same time, Gracemont keeps support of SIMD and AI acceleration instruction set extensions. Another requirement was wide scale of operating frequencies and voltages to allow for low power operation

Theory is nice, trial by fire is yet to come though

The reason we subjected you to this lengthy preface was to explain that there are some sound theoretical arguments to be made in favor of hybrid („big.LITTLE“) strategy, and they might hold true even for desktop high-performance processors. Even despite the prevalent scepticism that it is readily seen everywhere, including your comments under our articles that show that large part of hardware enthusiasts doesn’t expect this idea to be successful in PC processors and have negative expectations.

To be clear: we don’t yet know for sure if Alder Lake actually does manage to successfully solve the problem we just described. Sometimes even sound and well-researched ideas might fail to transform into a working solution. There are possible pitfalls, for example Intel might fail to make their implementation of efficient cores efficient enough to outweigh possible scheduling problems an asymmetric processor could cause. The concept behind the hybrid decision does seem reasonable and viable though.

With that said, let’s finally get to the promised analysis of the Gracemont microarchitecture itself. We however recommend that you read the similar analysis of the big core, Golden Cove (Performance Core), first, in case you have not done so yet. The little Gracemont microarchitecture (Efficient Core) should be viewed in context of its larger sibling architecture.

Suggestred reading: Intel Alder Lake/Golden Cove CPU core unveiled (µarch analysis)

The article continues in the next chapter.

Intel has revealed the Alder Lake CPU architecture, or actually two architectures this time. The CPUs are hybrid and besides the main „big“ ones, there are „little“ cores called Gracemont. These are not just for marketing or for low-power idle tasks like in mobile ARM SoCs, however. Gracemont should significantly add to the overall performance, the architecture is actually surprisingly beefy. Our analysis will show you more.

The goal of Gracemont was to design a core that achieves as high performance as possible within as little silicon footprint (and power consumption) as possible, which would allow good multi-thread performance scaling by adding more of these cores. The performance of the core should be around or even above the level of the older Skylake core—at least when looking at the same clock frequencies—but with Gracemont consuming much less power.

With IPC that is higher than Skylake, the core was also to be designed to run at low voltages, that would allow it to reach power consumption that is a fraction of what Skylake runs at. Intel even claims that Gracemont is the most power-efficient x86 core in the world to this date (meaning that it has the best power/performance ratio in the world).

There is one crucial difference: Gracemont is simpler in the aspect that it always executes just one thread, there is no form of HT (SMT) capability. We don’t know if Intel is absolutely opposed to adding this feature to Efficient Cores meant for this purpose, for example due to the costs that would pose in complexity, power draw and the extra transistors needed. It could be that the company doesn’t actually rule out Efficient cores with HT and it just didn’t get around to adding it to this architectural lineage yet. HT is a feature that has natural synergy with wide higher-performance CPU cores and Gracemont is actually a surprisingly wide design.

Frontend with 2×3 decoder clusters

Gracemont’s design does show that it has evolved from the past members of the Atom family of cores, but the architecture has been substantially beefed-up (widened and deepened). The core inherits the most striking feature of the previous Tremont architecture: its dual-cluster instruction decoders. Gracemont, like Tremont before, uses two clusters of decoders, each comprising of three units. The combination of these two clusters isn’t as strong as the six-wide decoder setup (which is a single cluster) in Golden Cove. The clusters can’t always couple up together to decode six consecutive instructions per cycle. What they can do is parallelly decode two instruction streams, with three decoded instructions per cycle in each of the streams. This solution is more energy and area efficient than the strong 6-wide decoding of Golden Cove.

The two clusters can be used simultaneously in some cases, for example when the processor knows there is a jump due to branching. It can then use the jump’s target address as a starting point from where the second instruction stream can be fetched and decoded on parallely using the second cluster. Branches can actually be quite common in code (Intel mentioned that they can commonly happen as often as every 6 instructions in the program), which means that even this at first sight limited usability of the second decoder cluster can matter surprisingly often.

While the decoding capacity is the same as in Tremont, the L1 instruction cache of the core has been doubled from 32KB to 64KB (which is 2× more than Golden Cove has, by the way). Similarly, Branch Target Cache used by the branch predictor has been enlarged and has capacity of 5000 entries in Gracemont (but we do not know that this value has been in Tremont). This should enhance branch prediction accuracy. The predictor also works with a long history, which should again improve its success rate.

Predecode stores meta information into L1i cache

The core uses an optimisation called On-Demand Instruction Length Decode. What it boils down to is that the processor does a predecode work when it loads parts of program code into the L1 instructions cache, and this predecode determines lengths of instructions in the code, which is then stored in the L1i cache as metadata. This gives the core an idea where individual instructions begin and the knowledge is reused when the processor reruns the same code (for example in loops), since part of the energy-intensive decoding work can be skipped.

Tremont has no μOP cache that would store fully decoded instructions for later reuse, like the big Golden Cove core (Intel uses μOP cache since Sandy Bridge, AMD since Zen 1, and even ARM with its fixed-length instruction set does, since Cortex-A77). One of the motivations besides performance is that μOP cache lowers power consumption. Using predecode and remembering the instruction boundaries seems to be an alternative to using full-blown μOP cache, an alternative that is simpler to implement and much cheaper in transistors needed (and hence requiring much less silicon area footprint). But also on the other hand an alternative that is less advanced.

Beefing-up the out-of-order engine and execution units

The biggest upgrades and strengthening changes Intel has done to the Golden Cove core were in the frontend, while the backend of the core (the execution units themselves) were boosted comparatively less. However, the Gracemont core is almost the opposite. While its frontend is not so different from Tremont, the core’s execution unit backend was widened and strengthened significantly. There is one thing common with how Intel evolved Golden Cove though: improving the middle-stages of the core, where the out-of-order optimization and reordering of instructions to be executed happens.

The Allocation stage in Gracemont can process up to 5 instructions (or better said, μOPs) coming from the decode stage and the μOP queue, per cycle. This capability is 25% better than in Tremont which was 4-wide for this stage. Gracemont also doesn’t fall much behind Golden Cove, which has 6-wide Allocation (6 μOPs per cycle). What this stage performs is allocating their working registers to individual μOPs. Also handled here is renaming registers to hide conflicts that happen in the code due to reusing the same limited number of architectural registers. Internally renaming (substituting) the registers allows such nominally (but not actually) conflicting operations to be performed simultaneously gaining performance. The processor also can eliminate some operations that don’t actually have to be scheduled into execution units altogether, in this stage (MOV Elimination, handling zeroing idioms).

Out-Of-Order window as big as Zen 3 has

Following this, the pipeline continues with the Re-Order Buffer (ROB), which is a queue that the processor can use to reorder the instructions and to pick those that are to be executed next, ideally achieving as much work being done in a single cycle as possible, utilising as many of the available execution units as it can. Out-of-order CPU can execute operations that do not depend on each other in parallel and if it happens that the instructions that are supposed to enter execution don’t yet have their prerequisites ready (due to data dependency, dependency on result from preceding instruction), it can reach further into the future to execute other independent instructions instead. This compacts-down the code so that it is fully executed within fewer cycles in total. However, for these out-of-order optimizations to be effective, there is one important factor: the amount of instructions of the code (a.k.a. out-of-order window) the CPU “sees” for this purpose—the more, the better. This “window” is equal to the ROB queue in Intel’s processors, so to get best performance, we want a bigger ROB.

And the ROB (Re-Order Bufer) in Gracemont is relatively large, at least for a core that some would call little. It’s depth is 256 entries. This might be just half of what the big Golden Cove core has, but it is also more than the big Skylake core had (224). And what is notable, this value is exactly equal to the ROB size of AMD’s high-performance Zen 3 core. While it is likely that Zen 3 is somewhat of an exception with strangely small size and AMD will soon also boost ROB depth massively, this does show that Intel is relatively aggressive in its sizing in the Gracemont architecture, pushing outside of the “little core” realm.

This bit about Gracemont containing a bit too much resources for a supposedly “little” core could also be said about the execution units in the backend. Intel has widened this part enormously and has added a significant number of execution ports, which are the interfaces under which various execution units are exposed in Intel CPU cores. Amusingly, Gracemont even contains more execution ports than the “big” Golden Cove core now. This can however be explained easily: Golden Cove uses a unified scheduler for all the operations (ALU, AGU and FPU/SIMD), with ALU and FPU/SIMD units sharing the same five execution ports. Not so Gracemont, its core uses split core design much like AMD’s cores (or ARM designs). ALUs have their own ports, AGUs have their own ports, but this time also FPU and SIMD that are operating on vector registers (YMM, XMM and legacy x87 registers) units received their own dedicated execution ports, so they don’t need to deal with sharing them with general ALUs anymore. We should however clarify that the little Atom cores have been like this before, Gracemont merely inherited this trait from Tremont and simply expands the number of the ports.

Where Tremont had just 10 ports for all its units, Gracemont has extra ALU, AGUs and FPU/SIMD units and due to this, the number of ports has jumped up to a whopping 17 (Golden Cove has just 12 due to its port sharing scheme).
The number of ALUs has been expanded from three in Tremont to four in Gracemont (same as the number Zen 3 commands, Golden Cove has just one more). Out of these four units, two perform only simpler operations like adds, shifts and logic ops. The remaining two ALUs have the same capability but are also able to perform integer multiply (these more complex ops can therefore be executed with the throughput of 2/cycle, while the simpler ops are up to 4/cycle). There are even two integer dividers. The ALUs are located behind ports 0, 1, 2 and 3.

Gracemont also has two separate ports (30, 31) that are occupied by two JMP units used for handling branches. Branching does not consume the performance of ALU units thanks to this; also Gracemont can handle two branches per cycle, while Tremont is limited to one.

FPU (SIMD): AVX & AVX2, VNNI

The separated FPU/SIMD unit part is exposed behind five ports. Two of them (28 & 29) are dedicated to Store Data pipelines (but from SIMD registers, as opposed to Store Data pipes that work with general purpose x86 registers). The actual execution units handling FPU and SIMD instructions are behind ports 20, 21 and 22. Gracemont is populated with two symmetrical ( = with identical abilities/limitations) pipelines for execution of floating-point SIMD instructions including adds (FADD), multiply (FMUL) and also AES cryptography acceleration. There is only one floating-point divider and only one SHA acceleration unit. The floating-point performance capability could be up to 2× stronger than in Tremont if we put division ops throughput aside. Tremont has just two ports for FPU/SIMD operations, but there is just a single FADD and a single FMUL unit behind these ports.

Integer SIMD operations (which are useful in multimedia code, for example) can even be executed at the rate of three ops/cycle for some instructions, compared to two per cycle in Tremont. However, only simpler instructions like integer SIMD add have this throughput, integer SIMD multiply is only executed by a single pipeline which gives 1 op/cycle throughput, unchanged from Tremont.

The biggest SIMD news is that Gracemont supports the AVX and AVX2 instructions, being the very first Atom-lineage microarchitecture that can do these 256bit instructions at all. However, Intel does not mention whether these instructions are executed in one cycle with the full width (which would mean the SIMD pipelines are fully 256-bit wide). If this was the case, Gracemont’s SIMD capability would be very close to the client version of Golden Cove (that has AVX-512 support disabled).
For that reason, we expect that Gracemont is more likely to just use 128-bit SIMD units, where either two units gang together to perform a single 256-bit instruction at the rate of one per cycle (whereas 128-bit SSE* instructions could be done at double rate), or that 256-bit operations will be split into two 128-bit μOPs and handled in two consecutive cycles. This is how for example Zen 1 executes AVX and AVX2. In theory it means the theoretical computation throughput is half of natively 256bit units (and unchanged from running 128bit SSE instructions), but in practice, the hit is not always as big.

Thanks to AVX and AVX2 support in Gracemont, it is possible to enable these instructions for the big Golden Cove too, which is perhaps the most important aspect here. Intel’s first hybrid chip, the Lakefield mobile SoC comprised of one Ice Lake/Sunny Cove core and four Tremont little cores, had to keep AVX/AVX2 completely disabled, because of Tremont’s lack of support, costing performance and compatibility. Another instruction set mentioned as supported in Gracemont and courtesy of that also supported by the whole Alder Lake package is FMA3 (fused multiply-add).

Another important bit is that Gracemont retains the support of VNNI instruction set extensions, even if it has been reduced from ops using 512-bit ZMM (AVX-512) registers to 256-bit YMM registers (for AVX/AVX2 ops). This again has enabled Intel to retain this instruction set extension, which is useful for AI acceleration, in the big Golden Cove core even if it was at the cost of limiting it to 256-bit width.

Gracemont also supports the Control-flow Enforcement Technology security extension, which was premiered in Tiger Lake processors, as well as VT-rp (Virtualization Technology redirect protection). The core should be further hardened against speculatieve/side-channel vulnerabilities and attacks, according to Intel.

The Load/Store part of Gracemont core is another very strong story: the core has four AGUs (address-generation units) on ports 10, 11, 12 and 13 and their pipelines can perform two reads and simultaneously two writes to the memory (or to be precise, towards the L1 data cache). This Load/Store capability matches the big Ice Lake/Tiger Lake/Rocket Lake processor cores, so we can consider the Load/Store subsystem very powerful. Tremont core had just two AGUs that could do only two memory operations per cycle in total, two reads or two writes (or 1+1 mix). The width of the memory operations is 16 bytes (128 bits), which means the core can read or write the data of two 128-bit SSE* vectors or one 256-bit AVX(2) vector per cycle—32B per cycle.

Gracemont is worse than the very strong Golden Cove in the bandwidth achieved. Golden Cove can’t do that many more operations (3 loads + 2 writes per cycle), but it crucially uses paths that are double the width, 256 bits (32B). This gives it significantly better bandwidth with reads theoretically possible at 96B/cycle and writes at 64B/s. This could cause huge differences in demanding SIMD computations using lots of data. On the other hand, for example Zen 3 provides nowhere near this bandwidth (it’s mostly half Golden Cove’s width and bandwidth, but also has just three AGUs instead of five). Yet its overall good performance shows that this could be less of a handicap than one would assume, or at least that it does not become a severe bottleneck that often.

The difference that Gracemont uses two dedicated Load pipelines and two dedicated Store pipelines while Tremont used two universal Load/Store pipelines means that when program code simultaneously reads and also writes data from cache, Gracemont can reach up to 2× throughput of these memory operations, so a significant theoretical bottleneck has been alleviated. However, in the case of code that largely just reads or just writes at a time and doesn’t mix the two, there is no difference compared to Tremont.
Lastly, similarly to Golden Cove, Gracemont also uses two separate pipelines on their own dedicated ports (8,9) for Store Data.

Cache: shared L2 (and L3)

The L1 data cache of Gracemont has a capacity of 32 KB, which is smaller than the 48KB L1D serving Golden Cove (Sunny/Willow/Cypress Cove in Ice/Tiger/Rocket Lake processors). It is however as big as also 32KB sized L1D of AMD’s Zen 3 core. In the case of Gracemont, the 32KB cache might however be lower-latency. Intel doesn’t state the overall latency, however it mentions that pointer-chasing is only 3-cycle latency (it is possible that other, general-case operations are 4-cycle latency).

L2 Cache has configurable capacity in Gracemont. it can be either 2MB or 4MB. This however is not per single core, instead the L2 cache is shared between a cluster of four Gracemont cores that forms a basic building block of this architecture. Based on information that has leaked some time ago the implementation in Alder Lake only uses the lower capacity, meaning that the quad-Gracemont clusters in the processors will each have just 2MB of L2 cache.

The bandwidth of this cache is 64 bytes per cycle (1024 bits/cycle), which is a decent amount of bandwidth for four cores. The latency is 17 cycles according to Intel which would be bad for a small dedicated L2 cache but we suppose it is acceptable for this large cache shared by four cores. Gracemont’s L2 Cache supports up to 64 parallel outstanding cache misses (these are requests for data that have not been found in L2 cache and thus are being requested from L3 Cache or RAM). Ability to have a large number of outstanding cache misses in flight improves memory performance. Bigger buffer of these operations decreases the chance that the core has to wait with further data requests because the earlier misses have not yet been served.

Intel says prefetching has been improved for all cache levels in Gracemont and the prefetchers are able to detect many more access patterns and select proper prefetch strides for them. Gracemont also supports Intel Resource Director technology, which allows software to configure bandwidth allocations and QoS for individual cores/threads.

Schematic view of the Gracemont core (Source: Intel)

Whereas L1 cache and L2 cache are part of Gracemont microarchitecture, we won’t directly cover the L3 cache here as Intel has not talked about it in the architectural reveal. Based on what we do know so far from unofficial sources, L3 cache will be common (and shared) to both Golden Cove and Gracemont cores in Alder Lake and the cache parameters will therefore be the same for both, Intel’s L3 cache is tied to the ring bus interconnect that will be connecting the big and little cores together. Or to be more precise, we should say it will connect big cores and clusters of little cores, because it looks not the individual Gracemont cores, but the whole quadcore cluster will be the client of the ring bus. In line with that, the quadcore cluster of course also has just one shared block of L3 cache, which should be identical to the same block that a single big Golden Cove core will have. The capacity of this block should be 3 MB in Alder Lake. The top SKU with 8 big and 8 little cores should therefore have a 30MB L3 cache comprised of 10× 3MB.

When presenting Gracemont, Intel is comparing its performance and efficiency with 2015 Skylake microarchitecture, which is one of the most (if not the most) common core in PCs thanks to the exceptionally long period during which Intel kept selling it. Intel even claims that Gracemont has higher IPC than this core and so it should have higher single-thread performance when the clocks are equal. We should point out that it is unlikely for Gracemont to achieve the same high frequencies (up to 5.3 GHz in the fastest Skylake iteration).

Intel’s comparison charts show Gracemont (Efficient Core) beating Skylake in performance in the SPECrate2017_int_base benchmark while simultaneously keeping power consumption lower—and Gracemont’s absolute performance curve is ending higher than Skylake’s. However, this might be just due to Intel limiting Skylake’s clock in this comparison to artificially lower values (I would guess it could be just 4.2 GHz which was top speed of the original Skylake, Core i7-6700K).

If both cores were clocked to achieve equal performance, Gracemont would require as little as 40 % or less of the power that Skylake would need. And looking at it the other way around, if Gracemont was allowed to consume as much power as Skylake, the new Efficient Core is said to be able to output up to 40 % more performance.

Performance and power consumption of Gracemont (E-Core) architecture compared toSkylake, in single-thread tasks (Source: Intel)

Intel then shows a multi-thread performance comparison, in which they pit four Gracemont cores against two Skylake cores with HT (and thus also four threads). In this scenario, Gracemont achieves up to 80% better performance in SPECrate2017_int_base than Skylake while again consuming less power. Intel even claims that the four Gracemont cores should be able to provide the performance of a 2C/4T Skylake at less than 20% of the power consumed. Sadly, Intel again doesn’t specify the clock frequencies at which these performance comparisons are drawn.

Performance and power consumption of Gracemont (E-Core) architecture compared to Skylake, in multi-thread tasks (Source: Intel)

It’s better to take these comparisons with a grain of salt, because when establishing power efficiency of a core (which is mostly what all these charts are about in the end), it is incredibly important to know at which frequency (and voltage, which is related to frequency) is the core running. We can be almost certain that if we for example clocked Skylake to 5.0 GHz, Gracemont would never be able to achieve 40% more single-thread performance, simply because its architecture likely has much lower frequency ceiling, meaning it will stop scaling higher much sooner. Due to the efficient architecture, Gracemont probably can’t even reach the same power consumption as Skylake does at the end of its frequency curve.

It is true that Intel has not yet revealed the depth (the number of stages) of Gracemont’s pipeline, which is one of the main factors limiting the frequency ceiling, but other factors (like the partially 3-cycle L1D latency) point to lower frequency ceiling compared to Skylake (or Golden Cove). When comparing power consumed by Skylake and by Gracemont, there is another important caveat to keep in mind, by the way: Skylake is 14nm, while Gracemont will have the benefit of Intel’s advanced 7nm manufacturing node (previously known as 10nm Enhanced SuperFin).

Gracemont vs. Golden Cove?

What’s perhaps more useful, is the comparison where Intel shows Gracemont performing against the big core of the same generation, Golden Cove (P-Core in the chart). While Intel might be downplaying its old product Skylake to make the new Gracemont look better, the company should have an incentive to not do the same with Golden Cove and we believe that this comparison should not therefore be at risk of being biased.

Intel’s charts suggest that the big performance core (Golden Cove) can achieve up to 50% higher single-thread performance than the best of what a Gracemont (E-Core) can reach. This of course would be at much higher power consumption, which is kind of the point of Efficiency Cores after all. It’s not clear what the power delta is, P-Core might even need multiples of the watts consumed by E-Core. E-Core does however too display an extremely knee-like curve at its top clocks. There is probably a relatively steep power cost to the last percents of its maximum performance. Reaching 100% of the achievable clock might need extreme additional power compared to the power consumption at 90% of the clock speed.

Performance and power consumption of Gracemont (E-Core) architecture compared to Golden Cove (Performance Core) (Source: Intel)

This showing would actually be pretty good for Gracemont. If Golden Cove achieves 50% better single-thread performance, this means Gracemont achieves two thirds of Golden Cove’s peak ST performance—which again could warrant putting it in the big core category, even if it means being comparable to big cores of a few generations ago. Such single-thread performance is still quite acceptable in the current day, particularly when we keep in mind that Efficiency core’s point is multi-thread rather than single-thread performance.

For multi-thread performance, Intel shows a comparison between a processor with four Golden Cove cores and a hybrid configuration that combines two performance Golden Cove cores and eight Gracemont efficient cores (2+8, such configuration is reportedly what Intel will sell in the 15W notebook CPU segment). Intel’s chart shows the hybrid alternative achieves more than 50% more multi-thread performance compared to the homogenous Golden Cove config. This is an interesting data point if true. It means that four Gracemont cores can add the same or even better multi-thread performance as two big Golden Cove cores. The first four efficient cores in other words fully make up for the loss of two performance cores and the second added quadcore group adds the same amount of performance as a bonus. If Intel’s projection is accurate, the 2+8 Alder Lake processors should be as fast as a hypothetical Golden Cove hexacore or better.

This comparison is however not made at the same power consumption. The 2+8 processor is shown to draw more power in return for its extra performance. It means that four Gracemont cores take more energy when fully loaded than one Golden Cove. It however has to be said that the power consumption and efficiency of both of the architectures will greatly depend on the clock and voltage used in the particular scenario.

The exact performance and IPC is something that should be left for the time when independent reviews are available after launch. Gracemont cores seem to reach clock speeds as high as 3.6 to 3.9 GHz acording to leaked specs of 125W desktop Alder Lake processors though, which promises nice performance if their IPC beats Skylake as Intel claims. At least in tasks that don’t profit from Hyper Threading (the second thread of big cores) tremendously, that is.

The Gracemont Core takes much less space in an Alder Lake processor than what is needed for one Golden Cove core, but we again don’t know the exact values yet. Intel’s schematic illustrations would make you think that four Gracemonts fit in the same area as one Golden Cove, but this is unlikely to be a fully accurate representation. Elsewhere in Intel’s presentation, it has been said that the cluster with four Gracemont cores takes similar area as a Skylake core so perhaps this 4:1 ratio could be close though (Skylake is however manufactured at much less dense 14nm node).

In practice, the “math” that removing the little E-cores would allow Intel to add merely one big P-core for every four little E-cores thus removed, might be pretty accurate. It looks like the big.LITTLE combination should indeed enable higher multi-thread performance, in line with what we were talking about in the first chapter. That’s the efficiency and scalability Intel is touting. But to be absolutely certain about this being true, we have to wait for Intel to reveal the actual core silicon area and power consumption data.

The little core is worth more than a little attention

As with Golden Cove, the first performance reviews of the Gracemont cores will be very interesting reads, perhaps even more so than with the Performance core. When Alder Lake launches some two months from now, it should be one of the most noteworthy and intricate CPU architecture noveltis of recent years, and that is regardless of how well will Intel manage to materialise the potential and performance promises. It will bring more new phenomena to study, benchmark and perhaps get used to than any other recent processor.

We have been focusing on Alder Lake all this time but perhaps it should be said that this type of efficiently big (rather than as big as possible) core that Gracemont represents should be useful for more than just padding the multi-thread benchmark scores of hybrid CPUs. This core should be a viable architecture even on its own, in homogenous all-Gracemont processors. One area where it should be good are cheap mobile processors, or CPUs for fanless devices. An SoC continuing the tradition of Intel’s low-power little-core processors (Apollo Lake, Gemini Lake, Jasper Lake…) would be a nice product. We however have no information as to whether Intel is working on one, at this time.

Sources: Intel, AnandTech (1, 2)

Jan Olšan, editor for Cnews.cz

Continue: Memory subsystem: AGU and Load/Store resources doubled