Zen 4 and VNNI
Ryzen 7000 with Zen 4 architecture is the first AMD processor to support 512-bit AVX-512 vector instructions. We’ve already discussed their benefits (bigger or smaller) here. But the Zen 4 cores support another instruction set extension that used to be Intel’s pride and joy, and now the roles have reversed a bit: VNNI. It seems to bring huge performance improvements in a number of apps, despite the limited 256-bit width of Zen 4 SIMD units.
You may have heard of VNNI (Vector Neural Network Instructions) before under the name DL Boost. This designation subsumed the 512-bit VNNI instructions, also sometimes referred to as AVX512_VNNI, on the one hand, and support for BFloat16 (AVX512_BF16) data type operations on the other. The second extension was first featured in the Cooper Lake server Xeons, the first one (VNNI) was one of Intel’s highlights for the 10nm Ice Lake and Tiger Lake processors (10th and 11th generation Core for laptops).
Intel promised that VNNI instructions would dramatically increase the performance of these processors in neural network operations, the “AI” applications for which these instructions are explicitly designed. They should use 16-bit and 8-bit precision (with integer values), which are useful for inference, i.e. for applying an already trained network. The company then partnered with Topaz Labs to have them use VNNI (via the OpenVINO framework) to optimize their applications (Gigapixel AI, Denoise AI, Video Enhance AI…).
And Intel then showed Topaz Labs apps in their official benchmarks, where they gave the 10th/11th generation quad-core mobile processors higher performance than they would normally get. At the time, the advantage over competing processors without VNNI was significant.
Previously an advantage for Intel, now for the competition
With the arrival of Zen 4, however, the tables are turning on this one. Ironically, Intel removed support for AVX512_VNNI instructions from Alder Lake processors because they use 512-bit ZMM registers and are one of the subsets of AVX-512 (albeit a very specific one). Conversely, AMD has jumped in with Zen 4 core that introduces these instructions, so now there’s a situation where the advantage is on their side.
In Topaz Labs apps, we did observe performance that is well above the average of the Ryzen 7000 in other programs in our reviews. The Ryzen 9 7900X was 90–126 % faster than the Ryzen 9 5900X, but even the Alder Lake processors got a similar beating – against those, the Ryzen 9 7900X is 75–95 % faster in these tests, which isn’t really in line with results common in other benchmarks and apps. And 7900X isn’t even the most powerful model AMD has in the Zen 4 lineup. We’ll see if the Ryzen 9 7950X manages to scale even higher. However, even the hexacore Ryzen 5 7600X already shows really high performance.
Zen 4 Benchmarks: AI applications Topaz Labs
Such an extraordinary performance increase shown by Zen 4 looks suspicious at first, but you may remember from the AVX-512 article that Phoronix found a number of tests using the OpenVINO framework (and hence probably VNNI instructions) where Zen 4 achieved similar up to 2× increase. So the explanation is obvious: although Topaz Labs apps VNNI acceleration was originally designed for Intel processors, it is also automatically enabled on Ryzen 7000s.
We asked Topaz Labs directly about this and received a confirmation that these programs do indeed use VNNI on Zen 4. And these instructions also, despite the fact that AMD implemented AVX-512 using 256-bit units, clearly have enough performance to make it worthwhile. So these scores are not some weird anomaly and do show a legitimate result – the speed boost is so anomalous because it is a case of accelerating specific operation and not general code performance.
According to Topaz Labs, their applications should also use the form of VNNI which is called AVX2_VNNI or VNNI/256 and was created for Alder Lake processors. Since Intel disabled AVX-512 on these processors, the VNNI instructions using the same 512-bit registers had to be disabled as well. The small Gracemont cores don’t have them and only support AVX2 (apparently with 128-bit units). However, because of the usefulness of VNNI, Intel made the aforementioned AVX2_VNNI version that works with just 256-bit registers for the hybrid processors. However, AVX2_VNNI should have just half the compute throughput (but so should Zen 4 given its double-pumped 256-bit operation), and will also probably be slower on E-Cores than the Golden Cove P-Cores.
And as the Core i9-12900K results show, the lower performance of AVX2_VNNI against the Zen 4 implementation is a very real thing. We originally wondered whether, for example, Topaz Labs’ AI applications ignore the AVX2_VNNI instructions in Alder Lake (or were not yet modified to make use of it), but the company says that this 256-bit version is actually used and thus Alder Lake is actually benefiting from it in these tests. (Unless their detection and usage is perhaps implemented in a later version than our methodology uses, perhaps?). On the other hand, the performance of other Intel processors that should have the original full-performance 512-bit version of VNNI (Rocket Lake, for example Core i9-11900K) is relatively low too. Those don’t see a similarly brutal performance increase over thire predecessor (Core i9-10900K) that Zen 4 does.
Who knows, perhaps Intel is now regretting that it invested in accelerating apps like Topaz Labs software via VNNI and OpenVINO, now that it sees how – at least for the moment – it benefits the competition mre than them…
Sources: Topaz Labs, Intel
English translation and edit by Jozef Dudáš
- Zen 4 and VNNI