Skymont architecture analysed: Intel little core outgrows the big?

Intel's new E-Core architecture looks very powerful, with a wider core than P-Core

Intel unveiled their next-gen Lunar Lake mobile processor at Computex 2024. It will power Copilot+ PCs with its NPU and is supposed to be very power efficient, but it’s extremely interesting mainly because of the new CPU architectures, which will power future Arrow Lake desktop CPUs. Ironically, the star of this generation might actually be the little efficient E-Core accompanying the big P-Cores. Its architecture seems to have taken a giant leap.

We’ve already discussed the big Lion Cove core, which, it should also be said, is heavily innovated and seems to have undergone the biggest changes since Sandy Bridge or Nehalem. However, Intel has proceeded to significantly rework (and beef up) the E-Core as well, which has the Skymont designation. This core seems to be even more beefy in some respects and should achieve the IPC class of big cores.

It almost makes one wonder if Intel is planning to discontinue the “Cove” line of big cores in one of the upcoming generations and start building the main architecture on this “Monts” line, originally based on Atoms.

Three clusters of decoders, nine in total

Skymont visibly builds on the foundation of Tremont and Gracemont cores, but massively strengthens it. This is most evident in the most striking feature of the architecture – multiple instruction decoder clusters. The Tremont and Gracemont architectures had two clusters of three decoders each, which took turns working; for example, when processing a branch, the second decoder would start processing the next instruction stream starting at the point where the jump in the code was headed. This solution is probably less complex and power consuming than having full-fledged parallel setup of six decoders in one cluster.

Skymont goes even further and has a total of nine decoders – specifically three clusters of three decoders each. It should work the same as in the previous cores, but the paralel operations of three clusters are alternated and overlapped instead of two, so it is a 50% improvement in the decoding phase, but there is technically not an increase in the throughput of one isolated cluster. Each of these clusters can also decode instructions that are microcoded (Intel refers to this as the Nanocode feature), which improves performance in a situation where the code would use them heavily.

These decoders are fed from the L1 instruction cache by the fetch phase, which fetches up to 96 bytes per cycle in total, which is a lot (this could probably be three times 32 bytes from different locations of the L1i cache, which would have to have three ports). The uOP queue, where operations from decoders go, has been deepened from 64 to 96 entries. The branch predictor has also been improved compared to the previous generation.

Reorder Buffer larger than in Zen 4

The rename and allocate phases were enhanced, and while Gracemont could handle six operations per cycle, Skymont can handle eight. The Reorder Buffer (RoB), within which the processor can reorder the execution sequence of instructions as part of Out of Order execution, has been increased in size by 60%, which is one of the features (but not the last one) indicating the very ambitious performance goals of the core. While Gracemont used a RoB of 256 entries, Skymont has a RoB with 416 entries.

By the way, Skymont has a deeper RoB than the Zen 4 architecture (320 entries) because AMD tends to use surprisingly small RoBs, but it’s also deeper than what the the Sunny Cove core from Ice Lake, Tiger Lake, and Rocket Lake processors (352 entries) employs. The current big Golden Cove core uses a RoB 512 entries deep, but the upcoming Lion Cove will already have 576 entries.

In addition to the RoB, the queues at each reservation station have also been increased. And the core also has more registers in its physical register file, so it has more capacity to rename architectural registers (to reduce conflicts) when processing code.

Little core that’s not exactly little

The separate ALU part and FPU (SIMD), which is a new feature of Lion Cove, was already present in the cores of this line, so there is no change here. In total, the backend with the execution units has a whopping 26 ports, and there are eight ALUs on those ports (01–07). Yes, more than in the big core, though they are split into complex ALUs and simple ALUs that can only execute a subset of operations. Four units should be able to do SHIFT operations, division is supported by two ALUs and multiplication by two ALUs.

Furthermore, the core has three ports with JMP units that can process up to three branches per cycle, and two ports with Store Data units. The load/store part is very interesting as it has seven ports and AGUs, again more than Lion Cove with six. Interestingly, though, only three AGUs are for load operations, which handle read operations, while the core has four AGUs for write (store) operations.

So the core can do three reads per cycle whose width stays at 128 bits, but four writes per cycle, even though this is talking about generating an address for an operation, the write itself still needs to use the Store Data Pipeline. Usually, the balance is skewed the other way around in favor of capacity in load operations (Gracemont could do two operations of each, so 1:1 ratio). But Intel seems to have some reasons why adding more store pipelines is worthwhile – maybe they’re simpler, and the cost of a fourth unit is lower than fourth load unit, giving better return on die size investment. Intel also deepened the read and write queues.

SIMD unit still 128-bit, but with doubled performance

Skymont still only has a 128-bit SIMD unit width, and is thus still somewhat behind the big cores in raw compute throughput it can provide. That is, compared to large x86 cores with 256-bit units (Intel), or now even 512-bit units (AMD’s Zen 5). However, ARM cores including those from Apple also have to make do with humble 128-bit units, including the largest cores, so Skymont is not at a big disadvantage against those.

The previous Gracemont core had three SIMD pipelines, but that was only true for integer SIMD vectors, asi it could only execute floating-point ops on two out of the three pipelines. Skymont has four pipelines and significantly boosted throughput for most operations; it can typically process twice as many instructions per clock, which means twice the GFLOPS/TOPS that can theoretically be squeezed out of the core. This improved performance can be exploited by multimedia applications such as video encoding and processing, as well as any AI applications performing inference on the CPU (since performance in VNNI operations is also doubled).

The SIMD unit can perform four common integer operations per cycle (+33% vs. Gracemont), two integer SIMD multiplications per cycle (2× vs. Gracemont). It can do four shuffle operations per cycle, which will make programmers optimizing code in assembler happy.

Floating point addition and normal operations are possible in four per cycle (2×), and the Skymont core can also do four FMAs per cycle (2×, though we don’t know), which is pretty powerful. The core can do two division ops per cycle (2×). The ability to accelerate AES (two operations/cycle) and SHA (one operation/cycle) is unchanged.

Intel states that the FMUL, FADD and FMA instructions also have shorter latency (fewer cycles delay before the result is available), which does not translate into theoretical throughput increase, but  raise the actual achieved performance in practice because operations can be dependent on each other and then the shorter latency speeds up the execution of the runningcode. The core also has native hardware support for floating-point rounding.

The final processing stages are also enhanced, the retire stage of the whole core’s pipeline can now process up to 16 instructions per cycle instead of 8. This is probably to allow the core to evict executed operations and their data from the core’s buffers and queues as quickly as possible after they are finished to free up resources for further operations sooner.

Cache memory capacities are apparently not increasing. Gracemont cores used a 4 MB L2 cache shared by four cores in one cluster, but some processors (including all Alder Lake dies, some of which are also used in 13th and 14th generation Core processors) have reduced the capacity to 2 MB. For Skymont, presumably the capacity should always be 4 MB.

But the data throughput is double for this cache, you can transfer 128 bytes per cycle from it to the L1 cache and back, compared to 64 bytes in Gracemont, which will come in handy when the boosted SIMD unit has to be kept fed with data when working. The L2 TLB has been boosted from 3096 entries to 4192.

Intel Skymont Core schematic (Author: Intel, via: Tom’s Hardware)

Intel’s most efficient high-performance core

That’s how the company characterizes the Skymont core, and there’s something to it, because this core can hardly be called little given the internal width of eight ALUs or nine decoders, and the performance should also be pretty high. According to Intel, Skymont even achieves a comparable IPC to the previous generation’s big core (Raptor Cove in Raptor Lake processors), so it’s probably largely the clock speeds it achieves what separates Skymont from the big core and from having shot at eventually replacing them. We don’t know yet if the clocks will be the same or higher than Gracemont (which reached 4.0–4.3 GHz maximum, even though that was outside its efficient band), or if they will decrease, which wouldn’t be entirely surprising given the width of the core.

According to Intel, Skymont achieves up to 38% better single-threaded performance per 1 MHz of clock speed (IPC) compared to the previous E-Core architecture in integer tasks. In floating-point code, which employs the significantly more powerful FPU/SIMD unit, IPC is claimed to be as much as 68% higher. These increases vary quite a bit depending on the application, and as you can see in the Intel chart, the percentages quoted are an average.

However, this comparison is not being drawn against the Gracemont architecture, but against the related but slightly different Crestmont core used in Meteor Lake processors. But note that that is an LP E-Core variant of Crestmont, so the comparison is to lower clocked cores present in the low power island of the processor, which crucially do not have access to L3 cache and according to testing that reduces their performance a lot. These cores have only 2MB of L2 cache, while Skymont benefits from the 4MB of cache in this test (and also from Lunar Lake’s SLC). Thus, part of the very attractive IPC increase may be due to the fact that Crestmont’s performance in the tests is throttled a bit by this LP implementation in the Meteor Lake processor.

According to Intel, Skymont achieves up to 2x the performance of the Crestmont LP E-Core (but that’s at a higher clock speed and power consumption). At the same power consumption as the Crestmont LP E-Core, it can be up to 70% faster. And if you lower the voltage and downscale the Skymont’s performance to the same as the Crestmont LP, the power consumption can bejust  one-third. However, this is how voltage and clock speed curves generally behave, and it seems to happen a lot with new architectures (not to mention architectures that are built on a better process node), so this doesn’t mean the core brings some absolutely stand-out efficiency gain.

Skymont has a better IPC than today’s Intel big core

However, the core should be powerful enough even if not compared to the disadvantaged version of Crestmont. In fact, Intel even provides a comparison to the performance of the big P-Core cores in a Raptor Lake processor. This comparison uses different conditions, it shows how Skymont cores work when integrated with the L3 cache and the processor’s ring bus, as is the case in desktop processors (and apparently will be the case in Arrow Lake processors).

In this comparison, Skymont has on average 2% higher IPC than the big core of Raptor Lake processors in both integer code and floating-point code. Again, this would really vary application by application, with Skymont being worse in some code and better in other. But it’s an interesting result. Gracemont was similarly able to catch up to the IPC of the Skylake architecture, which was a big core two generations older (but in years it’s six). Skymont is only one generation behind in IPC.

However, as mentioned before, Skymont is almost certainly going to achieve much lower clock speeds than the big cores in Raptor Lake and Arrow Lake processors. How much, remains to be seen. According to Intel’s charts, the maximum performance that can be squeezed out of Skymont will be lower than Raptor Lake’s P-Core (because of the clock speed), but it will peak at significantly lower power consumption. But over its full range of clock speeds, the Skymont core should always deliver a given level of performance at lower power consumption than the Raptor Lake P-Core would. The difference can be up to 40% of power consumption, or Skymont can be up to 20% faster when running the cores at particular given power limits.

But of course, in real processors, Skymont will be paired with Lion Cove P-Cores, which should do better than Raptor Lake’s P-Cores. They will have higher IPC and thus higher maximum performance than Raptor Lake. And, thanks to the 3nm manufacturing node, also better power efficiency. But Skymont should be a good complement to them in provdiding multi-threaded performance (or in battery life in mobile processors), so as a companion core it may have the lion’s share in making of Arrow Lake and Lunar Lake processors a success.

Skymont promises interesting results for new big.LITTLE processors

Of course, the evaluation of the core has to wait for independent tests showing real-world performance and power characteristics. We should be able to know more after those are available in the third quarter for the Lunar Lake processor and in the fourth quarter for Arrow Lake. But from the currently available information, it looks like Intel has done a very good job with Skymont.

Sources: Intel, AnandTech, Tom’s Hardware

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Blackwell: GeForce RTX 5000 architecture and innovations [Analysis]

Nvidia’s new graphics cards – the GeForce RTX 5090 and RTX 5080 – won’t be out until the 30th, but NDA is over and the first reviews of the top-of-the-line RTX 5090, which we also tested, are out. In this article, we take a look at the Blackwell architecture that powers these new GPUs, its new features and functions. DLSS 4, compute unit architecture and features of the GPUs as well as the software side of this new generation. Read more “Blackwell: GeForce RTX 5000 architecture and innovations [Analysis]” »

  •  
  •  
  •  

Intel cancels x86S, effort to clean up x86 CPUs legacy cruft

In fall, we reported here that Intel’s Panther Cove CPU architecture coming in two years (not in Panther Lake CPUs though) reportedly adopts an upgrade to the x86 instruction set labeled APX and also the AVX10 SIMD instructions, which are a compromise between AVX-512 and AVX2 designed for hybrid processors. However, the x86S architecture, the third of these awaited ISA innovations, will not be coming. It has in fact been cancelled. Read more “Intel cancels x86S, effort to clean up x86 CPUs legacy cruft” »

  •  
  •  
  •  

Batttlemage: Details of Intel Xe2 GPU architecture [Analysis]

Intel has unveiled the new Arc graphics cards, dubbed Battlemage or also “B-Series”, featuring a vastly improved architecture, giving Arc a second chance to gain favour with gamers – though it will have to do this mainly through low prices, as the newly released GPUs will only compete in the bracket of cheaper mainstream graphics. We’ve dedicated this article to the cards themselves now we’ll have a look at the architecture itself. Read more “Batttlemage: Details of Intel Xe2 GPU architecture [Analysis]” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *