AVX10 seeks to replace AVX-512, will work on big.LITTLE CPUs

Big change for SIMD instructions on Intel processors, but there may be downsides

Yesterday we reported that x86 processors – at least those from Intel – will get Advanced Performance Extensions (APX), a major change in the programming of these CPUs. But Intel is also planning big changes to SIMD instructions with AVX10 technology, which will unify vector extensions whose subsets are now a mess, and should also address the problem of big.LITTLE processors. But this may in turn kill the true 512-bit SIMD capability of AVX-512.

AVX10 instructions were announced together with APX, but they are not part of it, instead being a separate extension that is an evolution or modification of today’s SIMD AVX-512 instructions. So what are these instructions all about?

AVX10

First of all, AVX10 combines up all the possible AVX-512 subsets introduced so far, of which there are many, so software developers will be able to treat them as one guaranteed group (however, some seem not to have been included in this – for example, the VP2Intersect instructions, present only in Tiger Lake processors and quite possibly dead from now on). AVX10 in its first version, AVX10.1, will largely be a renaming of AVX-512, but with one significant change – it allows processors to work with only 256 bit wide (or even just 128 bit wide) operations instead of 512-bit instructions.

It is already the case with AVX-512 to a degree, processors already support versions of the normally 512-bit AVX-512 instructions that are only 256-bit wide, or 128-bit wide. At that point they are at the level of AVX(2) or SSEx raw performace-wise, but they also have the benefit of the new features brought by AVX-512 – which most importantly provides mask registers, Gather/Scatter and various other things, including 32 architectural vector registers, whereas AVX(2) provides 16 YMMx registers and SSEx only 8 XMMx registers. So the effect will be similar to APX, which adds additional registers to the general purpose instructions executed in integer ALUs.

These improvements make SIMD instructions more flexible, making it easier to optimize different algorithms – for the developer, but also for an autovectorization performed by a compiler software. Thus, using such half- and quarter-size versions of AVX-512 is advantageous over AVX(2) and SSEx – they are easier to code with and also sometimes better performance can be achieved, although the raw compute throughput provided by the number of units and their width is not changed. According to Intel, it should be possible to convert code from SSEx or AVX(2) to AVX10 simply through recompilation (although this may not optimally exploit the performance potential).

However, these smaller versions of the AVX-512 instructions have had one major weakness so far. Processors can never support just these narrower operations, currently. The way Intel designed their CPUs, the narrower capability is always simultaneously coupled with the presence of the full 512-bit AVX-512 registers (ZMMx) in the processor which also implies support for the full basic AVX-512. And accordingly, Intel does not support the half-size 256-bit AVX-512 instructions on their E-Core architecture, which, lacking ZMMx registers, can only support AVX2. And as we know, this then led to the need to disable AVX-512 even on big P-Core cores.

AVX10 will change this – it will support implementations in CPUs that only provide 256-bit registers (however, these CPUs will have twice the number of architectural registers compared to past CPUs with 256-bit AVX registers (YMMx). Even if the processor will have registers and SIMD units just this wide, it will still be able to use those half-size versions of instructions and will be able to declare AVX10 support. It should even be the case that purely 128-bit support should be possible, but that may not be used in any CPU design anytime soon.

(Almost) AVX-512 available to big.LITTLE processors

Intel has, in other words, isolated the 256-bit version of AVX-512 instructions, which will be usable even where 512-bit full-power instructions are not. More specifically, in Intel’s future big.LITTLE processors. Indeed, the 256-bit version of AVX10 will be supported on efficient “little” (though in reality, it’s more fitting to call them medium) E-Cores. And because of this, it will be possible to enable AVX10 support on large P-Core cores as well.

However, P-Core architectures will also get just the 256-bit version. So there will still be a degradation of possible performance for them versus the  theoretical scenario where they would keep supporting full-fat AVX-512. But in the cold reality of practice it is an improvement because the 512bit width could not be used in hybrid CPUs, whereas now the 256-bit AVX10 will be able to, and will therefore improve performance in multi-threaded applications.

512 bits now only on Xeons

However, AVX10 will also continue to support a 512-bit wide version that will be fully equivalent to today’s AVX-512 in capability and performance. However, this support is stated to be optional, unlike support for 128-bit and 256-bit instructions. Intel has said that these instructions will only be supported on processors that are comprised fully of big cores (P-Cores). Presumably this means that they will only be enabled on Xeon processors for servers and workstations.

AVX10 instruction roadmap and feature comparison with AVX2 and AVX-512 (source: Intel)

This is expected to continue on in future iterations of AVX10 instructions. The first version, AVX10.1, is apparently to be exposed for the first time to users on Xeon Granite Rapids processors, which will have that optional 512-bit support. This version is kind of a first introductory one, basically just a renaming of AVX-512. Only the next version, AVX10.2, will actually be featured on processors that will only have narrower versions of these instructions, and it is only this AVX10.2+ version will actually appear in big.LITTLE processors for PC. We don’t yet know if it will first appear in the Arrow Lake architecture, Lunar Lake architecture, or exactly what generation. It seems that AVX10.1 will not replicate 100% of functionality offered by 512-bit instructions on 256-bit instructions yet (the things missing should be minor things forced by the way the instructions are currently encoded, and example is “embedded rounding” functionality currently unavailable on 256-bit ops due to conflicts). These missing functionality bits should be fully handled in AVX10.2. Also, AVX10.2 is already supposed to again start introducing new operations that were not yet present in AVX-512.

Innovation, or playing whackamole with problems entirely selfinflicted ?

AVX10 doesn’t seem to be such a big deal as APX. Rather than entirely new functionality, it’s more about changing the instruction encoding and relaxing some previously existing restrictions, more arbitrary than fundamental. And it can be said that instead of innovating and coming with new ideas, Intel is basically just wrangling problems they have themselves previously created by introducing their concept of hybrid (big.LITTLE) processors mixing P-Cores and E-Cores. Also, while APX seems at least at first glance to be an obviously good and beneficial idea, there are reasons to have doubts in the case of AVX10.

Indeed, one could immediately argue that there was a better alternative approach Intel could have taken. Instead, they could have just invest the effort to enable the 512-bit ZMMx registers needed for AVX-512 on the E-Core architecture – just like AMD did with their Zen 4 and Zen 4c cores. This would have solved the problem elegantly and immediately, without forcing software developers, compilers and everything else involved to do extra work. The SIMD units of the E-Cores could still be physically narrower and process 512-bit operations in multiple cycles, so the cost would still not be big (even though more silicon would be needed compared to the AVX10 256-bit approach). But if this was done, E-Cores would not only run the same instructions and the same code as P-Cores, but it would also be the same code that already works on many existing processors sold from 2017 onwards. Also, P-Cores would not have to suffer performance degradation and could continue to have full 512-bit implementations. Overall performance in multi-threaded applications would be better than with AVX10.

An image of the Intel Alder Lake chip. Eight large P-Core cores are in the middle, and two quad-core E-Core clusters can be seen to the right of them. A horizontal ring bus runs in the center of the chip, together with L3 cache blocks. Each cluster with 4× E-Core is coupled to the same L3 cache block as one individual P-Core (Source: Intel)

Perhaps most importantly, developers would not have to write separate code for 512-bit versions of instructions and separate code for 256-bit versions of instructions and then integrate both versions of code into the application. Doing this is already extra work on top of the code written for AVX2-only processors (and probably a SSEx codepath for even older processors, or not-so old Intel lowend processors). AVX10 has created even more extra fragmentation with its approach, the 256-bit AVX10 operations will have to be yet another extra codepath in the programs, again. Those developers who invested time in optimizing software with AVX-512 instructions will no doubt be bursting with joy from these prospects (or buying AMD processors).

Unfortunately, there’s also the possibility that Intel will outright phase out 512-bit instructions, and could even remove them from the P-Core Xeon CPUs someday. In fact, Intel states that today’s AVX-512 instructions are now “legacy”, which will be supported “for the foreseeable future”, but not necessarily forever. So Intel may one day push developers to rewrite or recompile their code to AVX10, and remove support for the original AVX-512 from future CPUs. We do hope this doesn’t happen and the company actually still sees benefit in 512-bit wide vector ops (they clearly help Xeon processors today, though it may not be to Intel’s liking that they no help current AMD processors too), but the scenario is far from implausible.

Such abandoning of wider vector instructions can now be seen in the case of ARM. The company came out with the SVE SIMD extension years ago (2016) and later SVE2 (which added integer datatype ops), which aimed to abstract the vector register width away allowing vectors up to 2048 bits wide, which are however executed with far narrower units (meaning the execution is split over many cycles). It’s just that when first Neoverse and Cortex cores supporting SVE came along many years later, ARM reverted to narrow just 128-bit SIMD units, so the compute throughput is merely somewhere on the level of the old NEON instructions SVE(2) was supposed to replace. The Neoverse V1 architecture was the only one to have 256-bit units units (but only supports floating-point, as SVE2 is not included), but then, the V2 next generation reduced them to 128-bit width as well. We don’t know for sure that this isn’t just a temporary sacrifice, perhaps ARM will one day try to be more ambitious again and bring 256-bit and 512-bit execution units to boost compute potential to levels provided by AVX(2) and AVX-512. For now, though, it looks like the company is giving up on higher SIMD widths.

Hopefully this will not be the overall direction CPU development will take, because we would lose one of the avenue of increase per-thread CPU performance further, one that is very effective for example in multimedia applications, but potentially also in games. While it is true there are voices believing wider vectors to have limited benefits, for example Linus Torvalds who considers 256-bit width very likely to be optimal. However, this may be also influenced by the fact that his core work is outside of software that benefits most from SIMD – operating system kernels have limited use for it, often using SIMD instructions outside of their most natural applications, for example just to improve MEMCPY/MEMSET performance. In the end, it’s hard to predict where this will all go.

Simpler labelling and versioning

To be less negative, what is quite beneficial about AVX10, even if it doesn’t directly affect how the silicon works, is the change in how future extensions will be handled. One large part of the critical attitude towards AVX-512 was due to the fact that these instruction set extensions are fragmented into a large number of subsets, where each has its own individual flag that the code must detect. And, unfortunately, isometimes different processors brought in new extensions that didn’t overlap easily, and sometimes CPUs missed some of the previous subsets. The selection of AVX-512 instruction sets a particular processor supported made for some pretty complex graphs (though admittedly, the worst mess was mainly due to the existence of Xeon Phi).

This is how the number of different AVX-512 instruction sets has been gradually increasing. AVX10 will switch to numerical versioning and the chaos will be much reduced (source: Intel)

Even AVX10 will probably add new extensions in the future, but Intel says there will be more order in those futer iterations. New additions will be identified by version numbers (e.g. AVX10.2, AVX10.3…). And going forward, the higher version number is always supposed to include all subsets that were in previous lower versions. So it should not happen that a newer CPU adds some instructions but removes others, and it will be easier for programmers to detect what instructions the CPU can do and choose the correct version of SIMD functions in the code based on that.

Support from AMD: No information yet

As mentioned, AVX10 is coming first in Granite Rapids processors initially, with the main focus being as yet unspecified generation of Core processors for laptops and PCs later. We don’t yet know when, if ever, is AMD going to adopt AVX10. The competing vendor tends to implement the vast majority of new instructions invented by Intel (which makes sense given Intel’s dominant market position). But this isn’t always true, for example TSX and SGX didn’t appear in AMD processors, although that might also have something to do with the recurring problems with these extensions.

In this case, we could see AMD not having much of a issue with AVX10 support – even if it was in parallel with supporting the existing AVX-512 – so perhaps AMD could automatically join Intel. But since Intel, as the author, has a head start on implementation in their CPU cores, it’s not unlikely for AVX10 to appear in Ryzen and Epyc CPUs many years after Intel launches it.

Sources: Intel (1, 2, 3), AnandTech

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *