Cortex-A715: new efficiency-first ARM core (architecture analysis)

Cortex-A715: efficient middle core to the Cortex-X3 big

ARM has launched the Cortex-X3 architecture, which will be the new “big” high-perfromance CPU core for smartphones and laptops in the company’s lineup. It will however be complemented by a new Cortex-A715. This line of cores (Cortex-A76 to A78, then A710), which used to be ARM’s big cores, seems to be turning into cores tuned for best performance in a small footprint and power draw, not unlike Intel’s E-Cores.

The ARM Cortex-A715 should be the core that was announced two years ago under the codename Makalu (and at that time it was promised to provide 30 % higher performance compared to the Cortex-A78 – you will see in a moment if it lives up to the hype).

Second generation ARMv9 cores (source: ARM)

The Cortex-A715 core uses the ARMv9 instruction set and is purely 64-bit, completely removing support for 32-bit code. This allowed it, like the Cortex-X3, to undergo various optimizations and changes. The concept of this core is based on the fact that ARM has split the lineup of large cores into two branches. The Cortex-X cores strive for the highest single-core performance possible and are willing to sacrifice some power efficiency (higher performance at the cost of significantly higher power consumption) and to put greater investment in transistor count (die area) to achieve this.

However, the A-line Cortex cores aim to have the best possible performance per power consumption and performance per area ratios. Like Intel’s E-Cores, they should be ideal for achieving high multi-threaded performance within a certain power limit (at least in the context of smartphones). It is no coincidence that ARM’s Neoverse N server architectures designed for multi-core cloud processors is derived from Cortex A.

Changes in the core

The new features in Cortex-A715 are partly similar to the changes in Cortex-X3, but the middle core has most of its modification limited to the frontend (while Cortex-X3 also increased the number of ALUs in the integer part from 4 to 6). The two co-developed lineups seem to be getting more differentiated now.

Cortex-A715 core schematic (source: ARM, via: ComputerBase)

Improved branch predictors

ARM focuses a lot on branch prediction and prefetch because it helps power efficiency (but it also increases the IPC of the core), these are again improved in the Cortex-A715 core. In general, prediction accuracy (success rate) is supposed to be improved. The Direction Predictor has double the capacity and the branch history is supposed to be more precise.

The Cortex-A710 used two levels of branch prediction – a fast L0 prediction with a latency of 0 cycles and a more accurate L2 prediction with a latency of two cycles, which might eventually override (correct) the fast initial prediction. The Cortex-A715 instead has a three-level prediction, adding an intermediate level with a one-cycle latency and accuracy somewhere between the L0 and L2 predictors.

Cortex-A715 core, frontend changes (source: ARM, via: ComputerBase)

The performance of the predictors has also been increased. The core can predict two branches in one cycle and now it can also do so for conditional branches. This will help in code where branching occurs frequently. The performance for tag fetching in the L1 instruction cache has also been increased for such cases, which should help when multiple separate instruction sequences are fetched from the cache at the same time due to branching.

More decoders instead of uOP cache

A big change is in the instruction decoders. The fact that the core is purely 64-bit lead to significant area reduction of the instruction decoders. Reportedly, they are just a quarter of the size compared to the Cortex-A710, and they are also supposed to use lower power. We have already seen with Cortex-X3 that ARM has added an extra decoder due to this, and could even afford to shrink the uOP cache for already decoded instructions, the purpose of which is normally to save power by skipping the instruction decoding phase (and saving space as the core then needs fewer decoders).

Cortex-A715 adds a fifth instruction decoder but removes the uOP cache (source: ARM, via: ComputerBase)

In the Cortex-A715, ARM has apparently decided that the power consumption of the downsized encoders is low enough and the uOP cache is not needed any more, so this core removes it entirely and all instructions being processed go through full decoding. It thus goes back to how the things were before the Cortex-A77, in which ARM introduced this improvement for the first time (but on the other hand, the big Cortex-X3 core keeps using an uOP cache). Instead of the uOP cache, ARM has added another instruction decoder to the core, which means that the core now has five instead of four, and can therefore decode and send five instructions per cycle for further processing. In the previous core, the uOP cache performed the fusion of microOPs, this capability now seems to be performed before the fetch at the L1 instruction cache level.

Eliminating the uOP cache will degrade performance a bit, but the larger decoding capacity should make up for it (the Cortex-A710 was able to send five instructions from the uOP cache, but only four directly from the decoders). In addition, all decoders are now complex and can handle all instructions. Previously, some of them used to be simpler and some of the more complex instructions could only be processed by selected complex decoders (this is also used by Intel, for simplicity).

Improvements in the load-store part

On the other hand, the numbers of ALU and FPU execution units themselves have not changed, ARM has mostly kept the “backend” of this core the same and focused on making better use of its execution potential, which is what the previously described changes in the frontend are for. This probably also means keeping 128-bit vectors for SVE2 instructions (but this will also be related to the fact that in big.LITTLE configurations, the core continues to be paired with the old Cortex-A510 that uses this vector width).

Cortex-A715 core, changes in memory subsystem (source: ARM, via: ComputerBase)

However, the changes are in the load-store part – the goal is the same again, to make better use of existing execution resources by not waiting for data from memory. The L2 TLB has been increased by 50 %, but in addition, a single entry can store twice as many virtual addresses, so in some circumstances the effective capacity is tripled.

Furthermore, ARM increased the number of banks that make up the data caches (probably both L1 and L2 cache?). This allows for a more efficient approach, where simultaneous reads and writes can be better combined, and also, fewer conflicts occur.

Also, the Load Replay Queue has been deepened so that multiple requests for data from the L2 cache can be processed simultaneously. And the core should also generally have better prefetch improving CPU performance and limiting the number of memory accesses needed.

Cortex-A715 schematic and performance curve (source: ARM, via: ComputerBase)

Slight improvement for performance, but a decent bump in efficiency

The result of these changes is a core that is not particularly groundbreaking in performance. According to ARM, it should achieve about 5 % better performance than the Cortex-A710 at the same power consumption using the same manufacturing process. However, if you merely equalize the performance to the level of the previous core, the Cortex-A715 is supposed to run as fast as the Cortex-A710 with 20 % less power consumption. But ARM also states that the Cortex-A715 is expected to achieve roughly similar absolute performance to the Cortex-X1, the “big” core from two years ago.

Cortex-A715 core performance (source: ARM, via: ComputerBase)

However, it seems that ARM hasn’t quite met the advertised goal from 2020 that A715 (Makalu) will have 30 % better single-threaded performance than the Cortex-A78. Compared to this two year old core, the Cortex-A715 only has about 15 % higher IPC (performance per 1 MHz) and the clock speed increase probably won’t make up the rest. Thus, the improvements of the new mid-range core/E-Core are primarily reflected in lower power consumption.

The curves show how the performance and power consumption of each ARM core scale (source: ARM, via: ComputerBase)

Roadmap and an improved version of the small Cortex-A510 core

Lately, ARM has been introducing a new large (and in recent years, large and medium) core architecture every year. This will continue in the coming years. The company has already revealed the codenames of the next two generations – the next Cortex-A7xx core will be “Hunter” and the one after that “Chaberton”. Nothing more has been disclosed about these yet, but at least we know that new or updated architectures will come in 2023 and 2024, as well.

ARM also displayed a similar roadmap for the line of so-called “little” power-efficient cores (such as the Cortex-A53, A55 and A510). There was no new little core introduced this year; as we mentioned, the Cortex-X3 and A715 will continue to be paired with the existing Cortex-A510 core. Its new successor is said to only be released in 2023 and is codenamed “Hayes” (the A510 core is “Klein”). Thus, the new generation of this line will come in the next generation at the same time as the Hunter core (and probably also the Cortex-X4).

ARM has refreshed the Cortex-A510 core (source: ARM, via: ComputerBase)

Still, ARM has something as in lieu of a new small core. The Cortex-A510 core has undergone a minor incremental revision or refresh. This shouldn’t change its footprint, but it reportedly reduces its power consumption (and thus improves power efficiency) by about 5 %. ARM also claims a 5 % increase in maximum achievable clock speeds. Support for integration into up to 12-core SoCs has also been added. New ARM processors with Cortex-A715 and possibly Cortex-X3 will probably use this updated, slightly more efficient version of the Cortex-A510 cores instead of the original ones.

Sources: WikiChip, ARM

English translation and edit by Jozef Dudáš, original text by Jan Olšan, editor for Cnews.cz


  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *