Intel AVX-512 tested in x265: how to enable it and does it help?

AVX-512 for x265: how to use it, the effects on performance and the price you pay in power consumption

The 11th generation Intel Core processors (Rocket Lake) are the first mainstream desktop parts supporting AVX-512 instruction set, previously only available in Xeons or the X299 platform. One area where AVX-512 promises better performance is multimedia. We looked at one use: encoding HEVC video via x265. It’s little known, but x265 does not actually use AVX-512 by default. We’ll show you how to turn it on and what are the effects.

We already provide x265 encoding benchmarks of Rocket Lake processors that are included in our CPU testing, in which we test in HandBrake with default settings, as lots of other websites do. These tests suffer from one problem when it comes to AVX-512 enabled processors however, and that is that ever since optimizations using AVX-512 instructions were added to x265 (back in 2018), the program has kept them disabled by default. So when you just launch HandBrake or the commandline x265 executable, they won’t use the builtin AVX-512 support and Rocket Lake processors won’t benefit from the instructions either. The program only uses AVX2 (as well as some other instruction sets like BMI2, AVX and of course different versions of SSE).

You can check that this is true in the log, which reports “using cpu capabilities” line with a list of instruction extensions that the program actually employs during encoding. On Rocket Lake, it should look like this by default:

x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2

Why doesn’t the x265 encoder use AVX-512 by default? This is because these optimizations cover only some of the encoder operations, and as a result the encoding FPS boost is relatively low—don’t expect anything close to the 100% speedup that a 2x wider SIMD vector could achieve in isolated operations, in theory. Crucially, AVX-512 doesn’t just process twice the data in one instruction, it also increases power consumption of the CPU core. Therefore, Xeon processors have to reduce clock rates when using these instructions, which has an impact on the performance boost that is realised.

When the x265 initially received the assembly optimizations making use of AVX-512, it was discovered that they did not increase performance on the Xeon processors by the same or better factor as they caused the clock speed to decrease. In other words, the combined result was slowdown instead of speedup. And that’s why a decision was made to leave this code disabled by default— hence x265 won’t use it until you force it on (you can read about it here). And this setting seems to have stayed the same since then. The developers have recommended turning on the AVX-512 when, for example, you are encoding 4K with very slow settings. However, if your processor is overclocked to a fixed rate or for some other reason the AVX-512 does not reduce its clock speed, you should generally notice a performance boost. This should hopefully also be the case for Rocket Lake processors, which should keep their clock high on Z590 motherboards, even with AVX-512 engaged.

How to turn AVX-512 on (in x265 and Handbrake)?

Using the AVX-512 in x265 can be forced by using the commandline parameter ‑‑asm avx512. Use this if you are running x265.exe directly. However, if you are using a GUI or frontend, you need to find out how to pass the parameter to x265.

In HandBrake, this is done in the Video Encoder settings—you select x265 and at the bottom in the “Advanced Options” field, you can see extra command line parameters that are to be passed to x265. There should be a few already. All you have to do is adding a colon (without spaces) at the end, which separates the individual parameters, and adding “asm=avx512” after that. Without the quotes (see the following image).

Turning on AVX-512 for x265 in Handbrake

After that, a look at the log should tell you that the x265 now also uses AVX-512. The line about instruction extensions utilised should be saying this:

x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512

When you run Handbrake once with this option added and once without it, you’ll see how much Rocket Lake can benefit from the AVX-512 in HEVC encoding using x265.

As you can see, for the Rocket Lake models we tested this on, the AVX-512 increased performance in our encoding test by + 7.5%. The Core i5-11400F scored a little higher, with it the boost was +9%. According to the documentation, perhaps the benefit could be slightly better on the very slowest presets. In the end, these 512-bit vector operations have a relatively limited effect on x265 encoding, there’s no linear performance scaling compared to 256-bit vectors (AVX2).

Why is that so? The AVX-512 is used in various SIMD functions that perform analysis for intra and inter prediction, but this analysis is not always performed on such large blocks of data that a larger vector can be used efficiently. In addition, often the 2× speed boost does not apply to whole function, but only to some of its constituting steps. Modern video compression does not consist of calculations with a simple structure that would infinitely scale with the number of threads and the the SIMD width. In general, it is said x265 spends about half the time in these SIMD functions, the rest is C++ code (such as entropy coding) not using hand-written assembly code, and AVX-512 has no way to speed up this portion. Although the individual AVX-512 instructions provide twice the computing performance per processor cycle (100% increase in performance), at the end you will be left with about 10% gain due to all these limitations to the applicable scope of the acceleration.

However, even these modest numbers are not something to sneeze at in this area of software. As you can see in the interactive chart, while the Rocket Lake Core i7 and i9 octa-cores aren’t generally able to match the last-generation Core i9-10900K Comet Lake deca-core’s performance in multithreaded applications, the AVX-512 changes that, and octa-core Rocket Lake processors can now match ten cores of Comet Lake. They are still nowhere near competing with Ryzen 9 5900X, its higher number of cores is a bigger advantage than Rocket Lake’s AVX-512.

The extra performance, however, comes with a catch. Which is that the power consumption increases even more. Unfortunately, the measurements showed that power increase is disproportionately large with AVX-512 turned on: the power draw increased by 28% (for the Core i7-11700KF) or even 29% (i9-11900K). And it has to be said that the power consumption when encoding in x265 was not exactly low even before, in the first place.  From about 215–225 W during regular encoding, enabling AVX-512 will ramp the power consumption to approx. 270–290 W. The Core i5-11400F had significantly lower overall power draw, but the increase is also large with it, from 123 W to 159 W, which is 30%. These are power draw measurements on the 12V cable feeding the processor, which include not only the CPU power consumption, but also a loss in the VRM that handles voltage regulation for the CPU. But this loss is likely not a significant factor in our tests, as we use motherboards with a very overprovisioned VRMs, which should contribute to high efficiency (i.e. less heat loss) of the 12V to Vcore conversion.

In any case, the power dissipated in VRM is still energy consumed and also waste heat that needs to be cooled, so it’s not to be ignored. It is also not exactly a praise-worthy result for Intel that Rocket Lake has actually reached double the power consumption of the Ryzen 9 5900X and 5950X in this test (while offering lower performance). This confirms the problems with worse efficiency of the 14nm process, which gaming tests tend to downplay a lot.

In any case, the result is that turning on AVX-512 on Rocket Lake processors degrades their energy efficiency in x265 substantially, by up to 20% (if storage, motherboard, and RAM power consumption were included, it would be somewhat less bad). So, the same encoding is done a little faster, but also consumes more electricity. Of course, this may also cause more cooling noise, if the fans were not running at 100% even already without AVX-512.

However, remember this decrease in energy efficiency is not something that would happen generally. If an application had, for example, a 50% increase in performance (which can happen with some numerical calculation codebases, like in HPC), then energy efficiency would improve even with such an increase in power consumption.

Rocket Lake temperatures during encoding are not exactly low either. In our test, turning on AVX-512 raised the CPU temperature by 16–17 °C, and the i7 and i9 processors reached 93–95 °C. It is possible that in some cases, if you do not have good case airflow, the CPU could overheat and throttle.


What is quite remarkable are the clock speed levels. While it is quite possible that this is only due to the motherboard manufacturer aggressively setting the BIOS settings, we did observe that Rocket Lake processors, despite the high temperatures, did not reduce their clock speed at all when we used AVX-512 instruction set in x265. Programmers will be delighted by this because they no longer have to check if the AVX-512 increases performance enough to ensure that this effect is not completely lost or even outweighed by the clock speed reduction inflicted — which is probably the single most criticized thing about Intel’s AVX-512 implementations. In Rocket Lake, this did not happen, at least in our test. Although on other motherboards, such as in OEM computers, it might turn out differently, if the limits of PL1, PL2 and Tau were respected, the conditions would probably change a lot. The Core i7-11700KF kept 4.6 GHz on all cores, both with and without the AVX-512, and the Core i9-11900K also ran at its maximum all-core values—4.7 GHz—regardless of AVX-512. The same applies to the Core i5-11400F—it also ran at the same all-core boost maximum of 4.2 GHz, no matter what.

In short, the power consumption and heat load did not affect the multipliers, and the additional 7.5% of performance that we measured is therefore a pure performance improvement at identical clock speed.

However, this is also the reason for those high power consumption levels, the original architectural intention of the AVX-512 is to reduce the multiplier/clock (and therefore the voltage), precisely because of energy efficiency (with the downside that applications that only see a small increase in performance from AVX-512 will suffer).

Power consumption × power dilemma

So if you’re going to encode with x265 on a Rocket Lake processor, there’s a bit of a dilemma over whether to prefer a slightly higher speed at the cost of this deterioration in power consumption and efficiency. But for many of you, this may be acceptable. After all, something similar happens when overclocking, you also usually get some added percentage of performance, but power draw shoots up much more than the performance.

So if the cooler is making too much noise during encoding, or you prefer it to run more slowly but with less consumed energy, then you can of course leave the AVX-512 turned off.

Turning the AVX-512 on/off in x264

Everything written here is probably also valid for x264, where our tests of Rocket Lake processors also recorded quite high power consumption, likely also aggravated by the energy inefficiently the AVX-512 implementation on 14nm node, in combination with lack of clock speed reduction. But for the x264, we didn’t examine how much performance AVX-512 adds. It could even be a little more than with the x265, but the speedup will probably still not be greater than some 10%, at maximum 15%.

The difference compared to x265 is that in x264, AVX-512 optimizations are always used automatically. That means you already have the performance and power consumption effect included in the default settings. If you want to tone down your encoding with x264 a bit in terms of power consumption and you don’t mind that it will be a bit slower, you can try a reverse procedure, i.e. turn off the AVX-512 manually.

The parameter is the same, –asm, but this time you use –asm AVX2. This will limit the SIMD extensions used to up to AVX2. This will exclude AVX-512 (the parameter more or less indicates the highest extension that the program can use, and AVX512 is higher in the list than AVX2). If you use Handbrake, it’s similar—do the same thing as in the instructions for enabling AVX-512, but in this case you need to add asm=AVX2.


One more note: during our testing, a peculiar thing was happening to us with Handbrake—after assigning the parameter asm=avx512, we couldn’t see the increase in performance until we restarted the application after adding it (after restarting the application, performance and consumption always increased) and vice versa. It doesn’t make much sense, because x265 should accept the parameter right away without the need to restart Handbrake, but we wanted to warn you about this glitch just in case this anomaly happens to you too. It could be a bug in Handbrake (we had a problem when switching between the default profile for x265 and a modified version with the asm=avx512 parameter added).

Will Sapphire Rapids-X/Skylake-X get better boosts from AVX-512?

As you probably know if you’re interested in the AVX-512, Rocket Lake (or Tiger Lake in laptops) doesn’t have a full-speed AVX-512 implementation, unlike server processors and CPUs for the X299 platform. We haven’t measured the acceleration of AVX-512 on the X299 platform, but we don’t think the performance improvement would be much higher.

The slower version of AVX-512 in mainstream processors does not actually differ that much from the full-speed version in Intel server CPUs. Server CPUs have an extra 512-bit FMA unit, so they can execute FMA instructions at twice the performance of AVX2, while the client version uses only the existing 256-bit FMA units (which already regular Skylake processors have) and has half the performance in these operations.

But this difference probably only applies to floating-point FMA operations, which are important, for example, for scientific calculations. Video encoders and other multimedia software such as x265 typically use integer data operations, and  the limited client version of the AVX-512 has the same high performance as the server version in those instructions, 3 or 4 operations such as 512-bit integer addition can be run per cycle. Therefore, even this client version of AVX-512 in x265 benefits from the improved performance and in turn, you can’t expect much more from the server version, because it has no extra resources for integer operations (except for a larger L2 cache).

However, this does not mean the next generation of Intel architecture could not increase the performance advantages from the AVX-512 in general, of course. So it is possible that the next-gen server core (in Sapphire Rapids processors) actually will have higher speed boosts in x265, than Rocket Lake. We’ll see.

Translated, original text by:
Jan Olšan, editor for Cnews.cz


  •  
  •  
  •  
Flattr this!

Unstable Raptor Lake CPUs on the rise, Intel analyzes the issue

We recently reported on Intel Raptor Lake processors stability issues in games. Reports of crashes often accompanied by misleading messages about lack of video memory don’t seem to be subsiding, but rather multiplying, as do the number of games in which these problems are reported. Intel has not yet taken an official position on the matter, but is analyzing the problem. It is perhaps the biggest such issue with Intel processors in recent times. Read more “Unstable Raptor Lake CPUs on the rise, Intel analyzes the issue” »

  •  
  •  
  •  

Breaking records on an Asus mobo: 9.1 GHz with a Core i9-14900KS

An experienced group of extreme overclockers gathered around Intel’s latest and most powerful desktop processor (Core i9-14900KS) and managed to do unprecedented things. Namely, to reach over 9 GHz on the cores of this processor. That’s a high enough clock speed to break several world records in terms of speed measurements. In this short report, you will find what exactly was achieved. Read more “Breaking records on an Asus mobo: 9.1 GHz with a Core i9-14900KS” »

  •  
  •  
  •  

Intel announces 1.4nm process, first node with high-NA technology

This year, Intel is expected to complete its 2nm and 1.8nm production nodes (designated Intel 20A and Intel 18A) in culmination of their plan to develop five nodes (Intel 7, 4, 3, 20A and 18A) in four years and catch up with the technological lead of TSMC. Now, the company has revealed the next step that will come after this cycle, and a roadmap for enhanced nodes, reminiscent of the plus sign suffixes familiar from 14nm node era. Read more “Intel announces 1.4nm process, first node with high-NA technology” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *