Next generation custom server chips for Amazon AWS cloud
Last month, Microsoft unveiled their first custom processors being developed for datacenter and Azure services. Also Amazon, which was the first of these US hyperscalers to go the custom hardware route, is now launching new CPUs for its servers. And with it Trainium2, already the second generation of an in-house developed AI accelerator. Amazon also revealed that it has already produced over two million of its CPUs.
Graviton4: 96-core ARM Neoverse V2
Amazon released its first ARM processor for servers in 2018, back then it was more of a proof-of-concept SoC. The fundamental change was the use of ARM’s Neoverse N1 licensed architecture (Graviton2 in 2019), after which Graviton3 switched to the more powerful Neoverse V1 cores last year. Now a new generation of custom processors for Amazon AWS has been announced – the Graviton4.
Graviton4 has 96 cores (50% more than the previous generation) and provides a 12-channel DDR5-5600 memory controller. It is therefore equal in channel count to AMD’s Epyc 9004 Genoa, but it officially supports faster memory, so it achieves higher bandwidth; however, the Epyc 9004 provides 2x as many threads thanks to SMT. The processor is again chiplet-based, but unlike AMD’s design it uses one large silicon die with CPU cores (probably 4nm) and several smaller IO chiplets around it, while AMD’s uses one large IO chiplet as the central unifying element and the cores are in small CPU chiplets connected to it.
Unfortunately, Amazon doesn’t share many more details. We don’t know what the core clock speed is, for example, and the architecture used hasn’t been named either. However, The Next Platform states that it should be Neoverse V2 cores with 2MB of L2 cache and 192MB of L3 cache overall. Neoverse V2 is a modification of the mobile Cortex-X3 core. It’s also the same foundation on which Nvidia’s processor is built.
This core is already an architecture implementing the ARMv9 instruction set and SVE2 instruction support, but compared to the previous 256-bit V1 generation, the physical width of the SIMD units has been reduced to only 128 bits (i.e., like NEON/SSEx instructions), so it will likely not break records in raw SIMD compute performance (e.g., during scientific computing), especially if comparing Graviton4 against high-performance AVX-512 using x86 processors.
According to Amazon, Graviton4 has 30% better performance than Graviton3 in web applications, up to 40% better performance in database applications and 45% better performance in Java applications.
The processor will be available in EC2 R8g cloud instances. These, according to Amazon, will reach a maximum of 3x more threads than the current-gen R7g with Graviton3, which could mean that Graviton4 can now be run in a 2S configuration. These instances are only available for testing for now, commercial availability should come in the next few months.
Trainium2 against Nvidia
In addition to the processors, Amazon has also started working on its own AI accelerator design, a replacement for Nvidia’s GPUs designed to accelerate various AI applications (or train them). It is said to deliver up to 4x the performance of the previous first generation Trainium and up to twice the power efficiency.
Judging by the package shown, it should be an accelerator consisting of two symmetric computing chiplets, where each has an HBM memory in two packages. In total, the accelerator has 96GB of memory, so presumably 24GB packages are used.
Unfortunately, Amazon hasn’t revealed any other technical details here either. The original Trainium supported FP32, TF32, BF16, FP16, INT8 and FP8 neural network computations, with a claimed 190 TFLOPS performance in FP16/BF16 calculations. The quadruple for Trainium2 would mean 760 TFLOPS, but it is likely that the 4× is only meant as a ballpark figure.
In AWS server installations, Trainium2 is to be used in clusters of 16 accelerators and will be liquid cooled. Scaling is reportedly possible to even larger clusters of up to 100,000 accelerators, which could reportedly achieve performance of up to 65 EXAFLOPS (of course, only in AI operations, this is not a number comparable to the FP64 TFLOPS figures listed for supercomputers or FP32 general compute GPU performance figures).
Sources: The Next Platform, Serve The Home, Tom’s Hardware
English translation and edit by Jozef Dudáš
⠀⠀