Gracemont, the (not so) little Alder Lake core (µarch analysis)

Execution units: 17 ports and high IPC potential

Intel has revealed the Alder Lake CPU architecture, or actually two architectures this time. The CPUs are hybrid and besides the main „big“ ones, there are „little“ cores called Gracemont. These are not just for marketing or for low-power idle tasks like in mobile ARM SoCs, however. Gracemont should significantly add to the overall performance, the architecture is actually surprisingly beefy. Our analysis will show you more.

This bit about Gracemont containing a bit too much resources for a supposedly “little” core could also be said about the execution units in the backend. Intel has widened this part enormously and has added a significant number of execution ports, which are the interfaces under which various execution units are exposed in Intel CPU cores. Amusingly, Gracemont even contains more execution ports than the “big” Golden Cove core now. This can however be explained easily: Golden Cove uses a unified scheduler for all the operations (ALU, AGU and FPU/SIMD), with ALU and FPU/SIMD units sharing the same five execution ports. Not so Gracemont, its core uses split core design much like AMD’s cores (or ARM designs). ALUs have their own ports, AGUs have their own ports, but this time also FPU and SIMD that are operating on vector registers (YMM, XMM and legacy x87 registers) units received their own dedicated execution ports, so they don’t need to deal with sharing them with general ALUs anymore. We should however clarify that the little Atom cores have been like this before, Gracemont merely inherited this trait from Tremont and simply expands the number of the ports.

Where Tremont had just 10 ports for all its units, Gracemont has extra ALU, AGUs and FPU/SIMD units and due to this, the number of ports has jumped up to a whopping 17 (Golden Cove has just 12 due to its port sharing scheme).
The number of ALUs has been expanded from three in Tremont to four in Gracemont (same as the number Zen 3 commands, Golden Cove has just one more). Out of these four units, two perform only simpler operations like adds, shifts and logic ops. The remaining two ALUs have the same capability but are also able to perform integer multiply (these more complex ops can therefore be executed with the throughput of 2/cycle, while the simpler ops are up to 4/cycle). There are even two integer dividers. The ALUs are located behind ports 0, 1, 2 and 3.

Gracemont also has two separate ports (30, 31) that are occupied by two JMP units used for handling branches. Branching does not consume the performance of ALU units thanks to this; also Gracemont can handle two branches per cycle, while Tremont is limited to one.

FPU (SIMD): AVX & AVX2, VNNI

The separated FPU/SIMD unit part is exposed behind five ports. Two of them (28 & 29) are dedicated to Store Data pipelines (but from SIMD registers, as opposed to Store Data pipes that work with general purpose x86 registers). The actual execution units handling FPU and SIMD instructions are behind ports 20, 21 and 22. Gracemont is populated with two symmetrical ( = with identical abilities/limitations) pipelines for execution of floating-point SIMD instructions including adds (FADD), multiply (FMUL) and also AES cryptography acceleration. There is only one floating-point divider and only one SHA acceleration unit. The floating-point performance capability could be up to 2× stronger than in Tremont if we put division ops throughput aside. Tremont has just two ports for FPU/SIMD operations, but there is just a single FADD and a single FMUL unit behind these ports.

Integer SIMD operations (which are useful in multimedia code, for example) can even be executed at the rate of three ops/cycle for some instructions, compared to two per cycle in Tremont. However, only simpler instructions like integer SIMD add have this throughput, integer SIMD multiply is only executed by a single pipeline which gives 1 op/cycle throughput, unchanged from Tremont.

The biggest SIMD news is that Gracemont supports the AVX and AVX2 instructions, being the very first Atom-lineage microarchitecture that can do these 256bit instructions at all. However, Intel does not mention whether these instructions are executed in one cycle with the full width (which would mean the SIMD pipelines are fully 256-bit wide). If this was the case, Gracemont’s SIMD capability would be very close to the client version of Golden Cove (that has AVX-512 support disabled).
For that reason, we expect that Gracemont is more likely to just use 128-bit SIMD units, where either two units gang together to perform a single 256-bit instruction at the rate of one per cycle (whereas 128-bit SSE* instructions could be done at double rate), or that 256-bit operations will be split into two 128-bit μOPs and handled in two consecutive cycles. This is how for example Zen 1 executes AVX and AVX2. In theory it means the theoretical computation throughput is half of natively 256bit units (and unchanged from running 128bit SSE instructions), but in practice, the hit is not always as big.

Thanks to AVX and AVX2 support in Gracemont, it is possible to enable these instructions for the big Golden Cove too, which is perhaps the most important aspect here. Intel’s first hybrid chip, the Lakefield mobile SoC comprised of one Ice Lake/Sunny Cove core and four Tremont little cores, had to keep AVX/AVX2 completely disabled, because of Tremont’s lack of support, costing performance and compatibility. Another instruction set mentioned as supported in Gracemont and courtesy of that also supported by the whole Alder Lake package is FMA3 (fused multiply-add).

Another important bit is that Gracemont retains the support of VNNI instruction set extensions, even if it has been reduced from ops using 512-bit ZMM (AVX-512) registers to 256-bit YMM registers (for AVX/AVX2 ops). This again has enabled Intel to retain this instruction set extension, which is useful for AI acceleration, in the big Golden Cove core even if it was at the cost of limiting it to 256-bit width.

Gracemont also supports the Control-flow Enforcement Technology security extension, which was premiered in Tiger Lake processors, as well as VT-rp (Virtualization Technology redirect protection). The core should be further hardened against speculatieve/side-channel vulnerabilities and attacks, according to Intel.

The article continues in the next chapter.

  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *