ARM unveils record-breaking Cortex-X4 core with eight ALUs

New little core: Cortex-A520

ARM has been releasing a new generation of processor cores every year lately. This year will be no different (except for the Computex timing). ARM has unveiled a complete line of new architectures: a new Cortex-X4 “prime” big core for maximum single-threaded performance, a new medium Cortex-A720 core whose role is to provide multi-threaded performance (like Intel’s E-Core), and finally a new low-power Cortex-A520 little core.

In the last generation, ARM has has not introduced a new little core and instead the bigger cores were still paired with the previous generation Cortex-A510. Now ARM is fixing that and releasing a new little Cortex-A520 core, which is promising quite big improvements – but they are again improvements in efficiency instead of speed. These cores will still remain at a relatively low performance level. It is still an in-order architecture, so ARM is not following Apple, which conceives its little cores as more powerful out-of-order cores of actually non-trivial performance.

The Cortex-A520 is not a totally new design though, it still builds on the Cortex-A510 architecture as a base and it also retains the characteristic shared FPU. It supports ARMv9.2 instructions and is also a pure 64-bit core, so it no longer retains legacy compatibility with 32-bit applications. Like the previously described cores, the Cortex-A520 supports the new QARMA3 algorithm for the Pointer Authentication security feature.

While the Cortex-X4 core became larger and the Cortex-A720 has focused on efficiency optimizations while staying roughly the same widht, the Cortex-A520 has even been slimmed down in the name of efficiency. Previously, the Cortex-A510 raised the number of ALUs to three compared to the two used in older Cortex-A55, but the Cortex-A520 reverses this and returns to only two ALUs. This simplifies data communication between the units and issue logic, so power is saved. Performance will probably drop too, but less than the power draw. ARM has offset some of the performance drop with other modifications that have a lower cost than the third ALU.

In addition to the two simple ALUs, the core has one unit dedicated to integer division and multiplication and one unit for branching. Then the core has one load/store and one pure load unit, so it supports either two reads or one read and one write to memory per cycle.

But ARM doesn’t say much about the other changes. Branch prediction and prefetch have been improved again, which are components that will improve the efficiency of any core and allow more efficient utilisation of execution resources (AMD is also focusing a lot on these areas, it seems they are probably responsible for the fact that Zen 4 only needs fewer ALUs and other units to achieve about the same IPC as Intel cores with significantly more powerful execution resources and longer out-of-order queues).

The core has retained the same L1 cache capacities for data and instructions (32 or 64 KB option for both data and code). The shared FPU and SIMD unit, reminiscent of the FlexFPU solution in AMD’s Bulldozer, Piledriver, Steamroller and Excavator architectures, also seems to been retained in more or less the same way as in Cortex-A510. While the default approach is for the FPU to be shared, there is also an option to replicate and give each core its own dedicated unit (this has already been a feature of A510 core, too).

In general, ARM states that it has focused on changes that increase core performance significantly but cost little power (with the third ALU apparently having the opposite effect). Also, the memory subsystem (load/store units, queues, and caches) is said to have been tuned for greater efficiency.

Performance and power draw curve for the Cortex-A520

According to ARM, the Cortex-A520 is up to 22% more power efficient than the A510 when produced on the same manufacturing process – meaning that it can achieve a certain performance while drawing 22% less power than the Cortex-A510. Alternatively, it can achieve 8% higher performance at a given same power limit.

However, this may not scale to the highest possible clock speeds, it is possible that if you wanted to squeeze out maximum performance by overclocking, the Cortex-A520 would stop scaling at a lower point. However, this is not a scenario in which these cores will be used. These cores will run at low clock speeds as cores that will be entrusted with background tasks or running the operating system when it sits idle, waiting for user response, or is running in power saving standby mode.

Sources: AnandTech, WikiChip (1, 2, 3)

English translation and edit by Jozef Dudáš



  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *