Why is Zen 4 so fast in Topaz Labs AI apps? In fact it’s Intel’s doing

Zen 4 and VNNI

Ryzen 7000 with Zen 4 architecture is the first AMD processor to support 512-bit AVX-512 vector instructions. We’ve already discussed their benefits (bigger or smaller) here. But the Zen 4 cores support another instruction set extension that used to be Intel’s pride and joy, and now the roles have reversed a bit: VNNI. It seems to bring huge performance improvements in a number of apps, despite the limited 256-bit width of Zen 4 SIMD units.

You may have heard of VNNI (Vector Neural Network Instructions) before under the name DL Boost. This designation subsumed the 512-bit VNNI instructions, also sometimes referred to as AVX512_VNNI, on the one hand, and support for BFloat16 (AVX512_BF16) data type operations on the other. The second extension was first featured in the Cooper Lake server Xeons, the first one (VNNI) was one of Intel’s highlights for the 10nm Ice Lake and Tiger Lake processors (10th and 11th generation Core for laptops).

Intel promised that VNNI instructions would dramatically increase the performance of these processors in neural network operations, the “AI” applications for which these instructions are explicitly designed. They should use 16-bit and 8-bit precision (with integer values), which are useful for inference, i.e. for applying an already trained network. The company then partnered with Topaz Labs to have them use VNNI (via the OpenVINO framework) to optimize their applications (Gigapixel AI, Denoise AI, Video Enhance AI…).

And Intel then showed Topaz Labs apps in their official benchmarks, where they gave the 10th/11th generation quad-core mobile processors higher performance than they would normally get. At the time, the  advantage over competing processors without VNNI was significant.

Upscaling with AI from Topaz Labs (source: Intel)

Previously an advantage for Intel, now for the competition

With the arrival of Zen 4, however, the tables are turning on this one. Ironically, Intel removed support for AVX512_VNNI instructions from Alder Lake processors because they use 512-bit ZMM registers and are one of the subsets of AVX-512 (albeit a very specific one). Conversely, AMD has jumped in with Zen 4 core that introduces these instructions, so now there’s a situation where the advantage is on their side.

In Topaz Labs apps, we did  observe performance that is well above the average of the Ryzen 7000 in other programs in our reviews. The Ryzen 9 7900X was 90–126 % faster than the Ryzen 9 5900X, but even the Alder Lake processors got a similar beating – against those, the Ryzen 9 7900X is 75–95 % faster in these tests, which isn’t really in line with results common in other benchmarks and apps. And 7900X isn’t even the most powerful model AMD has in the Zen 4 lineup. We’ll see if the Ryzen 9 7950X manages to scale even higher. However, even the hexacore Ryzen 5 7600X already shows really high performance.

Zen 4 Benchmarks: AI applications Topaz Labs



Such an extraordinary performance increase shown by Zen 4 looks suspicious at first, but you may remember from the AVX-512 article that Phoronix found a number of tests using the OpenVINO framework (and hence probably VNNI instructions) where Zen 4 achieved similar up to 2× increase. So the explanation is obvious: although Topaz Labs apps VNNI acceleration was originally designed for Intel processors, it is also automatically enabled on Ryzen 7000s.

Read more: AVX-512 on Ryzen 7000: how useful is it and is AMD’s implementation better than Intel’s?

We asked Topaz Labs directly about this and received a confirmation that these programs do indeed use VNNI on Zen 4. And these instructions also, despite the fact that AMD implemented AVX-512 using 256-bit units, clearly have enough performance to make it worthwhile. So these scores are not some weird anomaly and do show a legitimate result – the speed boost is so anomalous because it is a case of accelerating specific operation and not general code performance.

According to Topaz Labs, their applications should also use the form of VNNI which is called AVX2_VNNI or VNNI/256 and was created for Alder Lake processors. Since Intel disabled AVX-512 on these processors, the VNNI instructions using the same 512-bit registers had to be disabled as well. The small Gracemont cores don’t have them and only support AVX2 (apparently with 128-bit units). However, because of the usefulness of VNNI, Intel made the aforementioned AVX2_VNNI version that works with just 256-bit registers for the hybrid processors. However, AVX2_VNNI should have just half the compute throughput (but so should Zen 4 given its double-pumped 256-bit operation), and will also probably be slower on E-Cores than the Golden Cove P-Cores.

Intel slide advertising the high performance of Topaz Labs AI applications, enabled by the AVX512_VNNI instructions of Ice Lake and Tiger Lake processors (source: Intel)

And as the Core i9-12900K results show, the lower performance of AVX2_VNNI against the Zen 4 implementation is a very real thing. We originally wondered whether, for example, Topaz Labs’ AI applications ignore the AVX2_VNNI instructions in Alder Lake (or were not yet modified to make use of it), but the company says that this 256-bit version is actually used and thus Alder Lake is actually benefiting from it in these tests. (Unless their detection and usage is perhaps implemented in a later version than our methodology uses, perhaps?). On the other hand, the performance of other Intel processors that should have the original full-performance 512-bit version of VNNI (Rocket Lake, for example Core i9-11900K) is relatively low too. Those don’t see a similarly brutal performance increase over thire predecessor (Core i9-10900K) that Zen 4 does.

Who knows, perhaps Intel is now regretting that it invested in accelerating apps like Topaz Labs software via VNNI and OpenVINO, now that it sees how – at least for the moment – it benefits the competition mre than them…

Sources: Topaz Labs, Intel

English translation and edit by Jozef Dudáš


⠀⠀

  •  
  •  
  •  
Flattr this!

RDNA 4 Radeon GPUs: specs and performance of both chips leaked

Previously, new GPU generations were coming in 2-year cycles, which would mean a launch this fall. However, Nvidia’s roadmap has put the GeForce RTX 5000 launch into 2025 some time ago. AMD is still unclear on the launch date of Radeon RX 8000s, but there’s some chance it’s within this year. The specs of these GPUs using RDNA 4 architecture have now surfaced on the internet. If they are real, it might even point to a release relatively soon. Read more “RDNA 4 Radeon GPUs: specs and performance of both chips leaked” »

  •  
  •  
  •  

AMD to produce lowend CPUs and GPUs using Samsung’s 4nm node

Back when the groundbreaking Ryzen processors launched, AMD was still manufacturing almost all of its products at GlobalFoundries, with the exception of chipsets designed by ASMedia. But now, by contrast, it is almost fully tied to the fortunes of TSMC. However, it looks like there could soon be some diversification in place. Samsung-made chips are coming to low-cost processors and they’ll also appear in Radeon graphics cards later. Read more “AMD to produce lowend CPUs and GPUs using Samsung’s 4nm node” »

  •  
  •  
  •  

AMD’s Zen 5 is said to be 40% faster over Zen 4. Can that be real?

The day when AMD releases processors with the new Zen 5 core, allegedly the biggest upgrade since the first Zen, is closing in. Their performance remains quite unclear though – on the one hand AMD’s materials talk about a 10–15% or a little bit more performance increase per 1 MHz, but at the same time there are rumors talking about 30% or now even 40% performance increase. So what to believe and what to watch out for? Read more “AMD’s Zen 5 is said to be 40% faster over Zen 4. Can that be real?” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *