Intel APX: x86 ISA upgrade to catch up with newer architectures

x86 gets the benefits of newer instruction sets without losing backwards compatibility

Recently, Intel announced the so-called x86-S architecture – a proposal to cut legacy baggage of the x86 processor platform. Another modernization effort of the traditional instruction set used in personal computers has been announced now. Future CPUs will get Intel APX (Advanced Performance Extensions), that break through some of the historical limits of this 40-year-old ISA and brings it somewhat closer to newer architectures.

APX: More general purpose registers

The APX or Advanced Performance Extensions apply to virtually all programs, which makes them different from extensions that add SIMD or other specialized instructions (Intel is planning new features in those areas as well, but we’ll cover that separately). APX is essentially a revamp of how all CPU instructions work, but it’s added as an extension in a way that can coexist alongside legacy applications written for x86 and x86-64. APX is a mode that can be enabled by placing the REX2 prefix (which is newly introduced for this purpose) before the instructions, which makes them to be executed in APX mode.

Code running with the APX extension gets access to twice as many general purpose registers (GPRs), 32 versus only 16 available in the x86 64-bit mode. These are the Extended GPRs designated R16 to R31.

The number of 16 was smaller compared to, say, ARM or Power (with APX, x86 becomes equal to them), but even those 16 was an upgrade compared to the original 32bit x86 ISA that only provided eight general purpose registers. The availability of more working registers will reduce the need to load dat from RAM to registers and store it back data when running out of registers to use. Thus, increasing the number of registers should improve processor performance per 1 MHz. According to Intel, 10% of load operations and 20% of store operations are typically eliminated when a program is compiled in APX mode.

APX: 3-operand format

The second change brought by APX is that all common integer instructions can behave as three-operand instructions. The original x86 instruction set worked in such a way that operations had only two operands, i.e., two registers that the instruction could specify as inputs. It was not possible to specify a third register (operand) for the result to be stored in, so it instead always overwrote one of the inputs. This was probably originally intended as a means to shrink binary sizes (which is also important to fit them into limited memory and later the instruction cache) amongst other reasons, but in turn it forces the programmer or compiler to add instructions to copy values between registers (MOV instructions) if there’s a need to preserve the overwritten input in order for it to be reused somewhere else.

In APX mode with three-operand instructions, these unnecessary MOVs copying data between registers disappears from the code. According to Intel, at least in some of the cases the company uses as reference, this should compensate for the increase in code size that happens due to the extra prefixes. Prefixes increase the average instruction length, but in return there are about 10% fewer instructions in the resulting binary. This three-operand mode is activated by the EVEX prefix (which was already introduced with the AVX-512 instructions).

Historic Pentium 4 processor. Chip with Willamette core for the 423 socket (source: ExtraHardware)

While the added register resources and the three-operand format have the described efficiency impacts, it should be noted that the performance gains may not always be large. In fact, processors have learned to work with and around these limitations over the decades. The two-operand instruction problem is circumvented by the MOV Elimination technique, which allows the added MOV operations to be skipped (more precisely, they are simply performed at the physical register file level in the register renaming phase, where it’s just a matter of reassigning pointers). In this way, several MOVs can be executed per cycle without occupying ALUs of the processor’s execution backend. Therefore, these operations can often be “free” for the programmer, as long as they do not overwhelm the CPU’s throughput in the preceding pipeline stages (fetch, decode…). MOV Elimination and three-operand operations are mostly competing optimizations, both benefiting performance by eliminating the same inefficiency.

Similarly, the excessive data movement between the CPU and RAM (or cache), or at least its impact on performance due to the limited number of registers, is in turn limited by techniques such as load-to-store and store-to-load forwarding and register renaming, where the CPU internally operates with a much larger number of usable registers than defined by the instruction set. These optimizations are applied automatically by the CPU, and in turn already deliver some of the performance gains that APX would extract on its own.

So the added benefit of the new mode may often not be that big, although there will probably still be some to had (judging by the fact that Intel is bothering with this extension). On the other hand, it’s absolutely true that APX does make up for some of the old x86 instruction set deficits that were (despite the things just mentioned) rightly considered weaknesses and criticised, and that is certainly a positive. Plus, thanks to the (admittedly complicated) implementation via prefixes, all this is achieved without having to break compatibility and start from scratch with a new instruction set.

Historic x86 processors from AMD and Intel (photo: Cnews)

Load-store pair and conditional instructions

In addition to this, APX adds even more new features. The addition of PUSH2/POP2 instructions is probably inspired by the ARM instruction set. These store into memory or load from memory the content of two registers (which must be consecutive in memory) simultaneously as one instruction. This is one of the ideas used by the ARMv8 architecture, and while it doesn’t really fit the orthodox RISC concept, these operations have proven to be very useful because code can achieve the same amount of work using fewer instructions with them. So now it will be possible to use something similar on x86 processors.

APX also adds conditional versions of some instructions. Since the Pentium Pro such instructions were present in the form of CMOV/SET, but now the options will be broadened by conditional versions of load, store and compare/test operations. In addition, writing flag bits with some common instructions can also be supressed. These optional modes will again be implemented using a prefix added before an instruction (i.e. load, compare), in this case it will again be the EVEX prefix.

Intel couldn’t help themselves from taking a dig at critics of the x86 instruction set in their APX presentation. x86 is often bashed for example for making parallel instruction decoding complex if not impossible due to not having constant instruction length (typically 32 bits) as used by RISC architectures, instead having variable instruction length, a legacy of the classic CISC processor era of the 1970s and 1980s. However, it is this very feature (bug?) that allows Intel to easily add different prefixes before instructions, as APX does, and thus change their operation, which would be more likely require handling through specifying altogether new instructions with RISC (or rather an architecture with 32-bit instructions). It has to be said that the x86 instruction set has already been using prefixes generously, and they bring their own complexity and messiness, so it’s questionable to what extent it’s something to brag about.

Compatibility

However, the advantage the method Intel is using is that it should be possible to mix code using only the traditional form of instructions with this new mode on a processor supporting APX at will. A programmer can take advantage of APX purely through automatic recompilation of existing code, but in many algorithms it will probably be possible to get further speedups by rewriting (manually optimizing) code directly to fit the new situation.

For example, math libraries or the FFmpeg backend and similar computationally intensive components could be compiled and hand-optimised for APX, but they could interact normally with software compiled the old-fashioned way – or even with old software that doesn’t have any idea that APX exists. It should probably be possible to mix code and individual instructions working without prefixes in traditional mode with instructions executed in APX mode, because APX only modifies a single instruction instead of switching the CPU’s working mode globally, it seems.

But for such code to run on older CPUs, it will have to use CPUID detection and CPU extensions, and the binary will have to have an alternate “codepath” for non-APX CPUs.

Intel Meteor Lake-U processor presented at Vision 2022 event (source: PC Watch)

When?

It is not yet clear in which processors APX will first appear. Intel hasn’t yet indicated which future generation of processors they will be first implemented in, but the first might be Granite Rapids in server Xeons and possibly Meteor Lake and/or Arrow Lake architectures in Core client processors. But it could also be in later architectures. Intel will probably reveal this when it starts to unveil the processors in more detail. We might learn something during the Intel InnovatiON event in September (September 19–20).

However, you can already read on the instruction specifications in this document.

Another question is when and if support for this extension will also be added by competing x86 processors, which is essentially AMD for the most part, but there’s also the Zhaoxin of China. Since Intel created the specification and has probably already been at work on the implementation before its release, the company will naturally have a head start. Therefore, implementation in AMD processors may be quite late compared to Intel ones.

Source: Intel

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Leak gives a peek at the clock speed of Intel Arrow Lake CPUs

Last week, we covered the clock speeds of Zen 5 Ryzen 9000 CPUs. Now another information leak uncovered what clock speeds possibly reached by the Arrow Lake processors from Intel. This is an important piece of the puzzle, as we already know what the IPC of their cores will be, but the clock speeds were a big unknown, given how the new wider architecture and TSMC’s 3nm node could have drastically lowered them. And with that, the performance. Read more “Leak gives a peek at the clock speed of Intel Arrow Lake CPUs” »

  •  
  •  
  •  

Zen 5: AMD’s most innovative core since the original Zen [analysis]

It’s roughly two weeks until AMD releases processors with the new Zen 5 architecture. This week, we finally got proper details on these CPUs’ architecture, which AMD revealed at the Tech Day event. So, we can now break down the changes the company has made to the core, compared to Zen 4 – and they’re pretty extensive, probably more so than they seemed in June. And AMD also reiterated its promise of a 16% increase in IPC for these CPUs. Read more “Zen 5: AMD’s most innovative core since the original Zen [analysis]” »

  •  
  •  
  •  

Oryon, the Nuvia ARM core of Snapdragon X: Architecture analysis

At Computex 2024, Intel introduced the new Lion Cove and Skymont architectures, which we covered in detail. AMD also shared a peek at their competing Zen 5 core, but with little detail, so we’ll have to wait with our analysis of the architecture. But there’s a new ARM-based challenger entering the fray – the Snapdragon X Elite currently coming to laptops. And Qualcomm has now also finally teased its “Nuvia” Oryon architecture. Read more “Oryon, the Nuvia ARM core of Snapdragon X: Architecture analysis” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *