Why is Zen 4 so fast in Topaz Labs AI apps? In fact it’s Intel’s doing

Zen 4 and VNNI

Ryzen 7000 with Zen 4 architecture is the first AMD processor to support 512-bit AVX-512 vector instructions. We’ve already discussed their benefits (bigger or smaller) here. But the Zen 4 cores support another instruction set extension that used to be Intel’s pride and joy, and now the roles have reversed a bit: VNNI. It seems to bring huge performance improvements in a number of apps, despite the limited 256-bit width of Zen 4 SIMD units.

You may have heard of VNNI (Vector Neural Network Instructions) before under the name DL Boost. This designation subsumed the 512-bit VNNI instructions, also sometimes referred to as AVX512_VNNI, on the one hand, and support for BFloat16 (AVX512_BF16) data type operations on the other. The second extension was first featured in the Cooper Lake server Xeons, the first one (VNNI) was one of Intel’s highlights for the 10nm Ice Lake and Tiger Lake processors (10th and 11th generation Core for laptops).

Intel promised that VNNI instructions would dramatically increase the performance of these processors in neural network operations, the “AI” applications for which these instructions are explicitly designed. They should use 16-bit and 8-bit precision (with integer values), which are useful for inference, i.e. for applying an already trained network. The company then partnered with Topaz Labs to have them use VNNI (via the OpenVINO framework) to optimize their applications (Gigapixel AI, Denoise AI, Video Enhance AI…).

And Intel then showed Topaz Labs apps in their official benchmarks, where they gave the 10th/11th generation quad-core mobile processors higher performance than they would normally get. At the time, the  advantage over competing processors without VNNI was significant.

Upscaling with AI from Topaz Labs (source: Intel)

Previously an advantage for Intel, now for the competition

With the arrival of Zen 4, however, the tables are turning on this one. Ironically, Intel removed support for AVX512_VNNI instructions from Alder Lake processors because they use 512-bit ZMM registers and are one of the subsets of AVX-512 (albeit a very specific one). Conversely, AMD has jumped in with Zen 4 core that introduces these instructions, so now there’s a situation where the advantage is on their side.

In Topaz Labs apps, we did  observe performance that is well above the average of the Ryzen 7000 in other programs in our reviews. The Ryzen 9 7900X was 90–126 % faster than the Ryzen 9 5900X, but even the Alder Lake processors got a similar beating – against those, the Ryzen 9 7900X is 75–95 % faster in these tests, which isn’t really in line with results common in other benchmarks and apps. And 7900X isn’t even the most powerful model AMD has in the Zen 4 lineup. We’ll see if the Ryzen 9 7950X manages to scale even higher. However, even the hexacore Ryzen 5 7600X already shows really high performance.

Zen 4 Benchmarks: AI applications Topaz Labs

Such an extraordinary performance increase shown by Zen 4 looks suspicious at first, but you may remember from the AVX-512 article that Phoronix found a number of tests using the OpenVINO framework (and hence probably VNNI instructions) where Zen 4 achieved similar up to 2× increase. So the explanation is obvious: although Topaz Labs apps VNNI acceleration was originally designed for Intel processors, it is also automatically enabled on Ryzen 7000s.

Read more: AVX-512 on Ryzen 7000: how useful is it and is AMD’s implementation better than Intel’s?

We asked Topaz Labs directly about this and received a confirmation that these programs do indeed use VNNI on Zen 4. And these instructions also, despite the fact that AMD implemented AVX-512 using 256-bit units, clearly have enough performance to make it worthwhile. So these scores are not some weird anomaly and do show a legitimate result – the speed boost is so anomalous because it is a case of accelerating specific operation and not general code performance.

According to Topaz Labs, their applications should also use the form of VNNI which is called AVX2_VNNI or VNNI/256 and was created for Alder Lake processors. Since Intel disabled AVX-512 on these processors, the VNNI instructions using the same 512-bit registers had to be disabled as well. The small Gracemont cores don’t have them and only support AVX2 (apparently with 128-bit units). However, because of the usefulness of VNNI, Intel made the aforementioned AVX2_VNNI version that works with just 256-bit registers for the hybrid processors. However, AVX2_VNNI should have just half the compute throughput (but so should Zen 4 given its double-pumped 256-bit operation), and will also probably be slower on E-Cores than the Golden Cove P-Cores.

Intel slide advertising the high performance of Topaz Labs AI applications, enabled by the AVX512_VNNI instructions of Ice Lake and Tiger Lake processors (source: Intel)

And as the Core i9-12900K results show, the lower performance of AVX2_VNNI against the Zen 4 implementation is a very real thing. We originally wondered whether, for example, Topaz Labs’ AI applications ignore the AVX2_VNNI instructions in Alder Lake (or were not yet modified to make use of it), but the company says that this 256-bit version is actually used and thus Alder Lake is actually benefiting from it in these tests. (Unless their detection and usage is perhaps implemented in a later version than our methodology uses, perhaps?). On the other hand, the performance of other Intel processors that should have the original full-performance 512-bit version of VNNI (Rocket Lake, for example Core i9-11900K) is relatively low too. Those don’t see a similarly brutal performance increase over thire predecessor (Core i9-10900K) that Zen 4 does.

Who knows, perhaps Intel is now regretting that it invested in accelerating apps like Topaz Labs software via VNNI and OpenVINO, now that it sees how – at least for the moment – it benefits the competition mre than them…

Sources: Topaz Labs, Intel

English translation and edit by Jozef Dudáš


Flattr this!

AMD Ryzen 7 7700X: More efficient, but much weaker than Core i7

It clearly outperformed its predecessors (including the R7 5800X) and also the Core i9-10900K, which has two more cores. But it can’t compare with the Core i7-13700K in heavy MT workloads, and that’s just a month after its release. In ST loads, however, it’s even performance-wise, as in games, where Ryzen 7 is more efficient. Whether the R7 7700X is worth more than the Ci7-13700K for similar money, however, you’ll have to judge for yourself. Read more “AMD Ryzen 7 7700X: More efficient, but much weaker than Core i7” »


Ryzen 5 7600: Raphael in AMD’s most popular series scores again

This time, it wasn’t as long a wait as for the Ryzen 5 5600. On the contrary, the Ryzen 5 7600 came out very shortly after the faster model with the “X” in the name. AMD apparently hurried with this lower-power model also because of dissatisfied reactions that 7600X criticize the worse cooling. With the significantly more efficient R5 7600, the situation with silicon is brighter. Plus it’s cheaper and doesn’t lose all that much in performance. Read more “Ryzen 5 7600: Raphael in AMD’s most popular series scores again” »


AMD Ryzen 7000X3D release date revealed? (Update: Sadly not)

Last week, AMD has unveiled the awaited expansion of the AMD Ryzen 7000 processor family with “X3D” models with 3D V-Cache. It boosts performance in games, so these could be very good (maybe even the best?) gaming PC processors. AMD has disclosed the specs, but hasn’t given the availability date for these models. We already have this information though, perhaps inadvertently, divulged by the company itself on its website. Read more “AMD Ryzen 7000X3D release date revealed? (Update: Sadly not)” »


Leave a Reply

Your email address will not be published. Required fields are marked *