ARM unveils record-breaking Cortex-X4 core with eight ALUs

Cortex-A720: The new medium core against Intel's E-Core

ARM has been releasing a new generation of processor cores every year lately. This year will be no different (except for the Computex timing). ARM has unveiled a complete line of new architectures: a new Cortex-X4 “prime” big core for maximum single-threaded performance, a new medium Cortex-A720 core whose role is to provide multi-threaded performance (like Intel’s E-Core), and finally a new low-power Cortex-A520 little core.

The Cortex-A715 also got a new replacement, named A720. This class of cores previously served as big cores, but was not optimized as much for single-threaded performance as it was for efficiency and chip area. After the job of 1T performance boosting shifted to Cortex-X, these cores of the Cortex-A line became specialized for multi-threaded performance, as already mentioned. So you can think of them as a counterpart to Intel’s E-Core (now Gracemont architecture). In cheaper mobile SoCs, however, the Cortex-X4 may be missing and the Cortex-A720 will be the big core.

With the Cortex-A720, ARM hasn’t made such big changes compared to the previous Cortex-A715 design as you could see in the previous chapter, and it’s more of an evolutionary improvement. The core is to be focused mainly on improving efficiency (in terms of power draw, but also in terms of chip area and cost). It also adds support for the ARMv9.2 instruction set.

The Cortex-A720 apparently reduced the pipeline length the same way as the Cortex-X4, as the branch misprediction penalty was reduced from 12 to 11 cycles. Branch prediction has also been improved again, but probably not in terms of performance, but to make the processing of branches more efficient in terms of power draw (presumably without any negative impact on performance). Like the Cortex-X4, the A720 core has no micro-op cache, but in this case it has already been removed in the previous A715 core.

The Cortex-A720 has also switched to a pipelined floating-point divider like the X4 core, so this core will also see an improvement in FDIV instruction performance. Pipelining has also been added for floating-point square-root calculation (FSQRT). The result is a speed increase in these operations (probably both in terms of latency of this instruction and throughput, or how many ops the unit can handle in a certain number of cycles). At the same time there is no significant increase in the surface area of the divider according to ARM.

The FPU also gets an increase in speed when forwarding values from floating-point and SIMD (Neon, SVE/SVE 2) registers to general integer registers. Thus, it takes less time before the results from these instructions are available for further processing outside the FPU. Data forwarding network to the AGUs performing memory (and cache) writes has also been improved, as was the the working of load/store queues.

The Cortex-A720 also has an accelerated L2 cache that has a latency of only 9 cycles compared to 10 for the previous Cortex-A715. Its data bandwidth has also been improved for at least some types of operations. In fact, ARM claims that MemSet operations in the L2 cache run up to 2x faster.

This core also has improved prefetchers, which, together with branch predictors, is an area that is continuously improved in virtually every new generation and has a direct impact on improving IPC. The core should now have spatial prefetching to the L2 cache, which was previously only present in Cortex-X cores.

Slightly better performance, but mainly efficiency

According to ARM, the Cortex-A720 core should have a 1–13% performance improvement over the Cortex-A715, and this will again vary from task to task (on average, the improvement will probably only be around 5%). This is stated for cores made on the same manufacturing process. Efficiency is supposed to go up a bit more than raw performance, reportedly some 6% better on average on the same process (but in practice this too will vary application by application). You can see this variance in the chart for the SPEC benchmark tasks.

Performance and efficiency improvements for Cortex-A720

ARM offers several configuration options for this core, and in addition to the more powerful option, there is also supposed to be a version that has roughly the same surface area as the Cortex-A78 (2020 design) when implemented. This configuration has lowered performance but is still supposed to be 10% faster than this older A78 core. Its purpose is to be used in SoCs for inexpensive phones that use old cores like the Cortex-A76 and the A78. This stripped-down version of the Cortex-A720 could prompt their manufacturers to finally switch to the newer architecture with the ARMv9 instruction set.

The article continues on the next page.


  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *