Mobile Zen 5 is here: Ryzen AI 300 “Strix Point” SoC detailed

Ryzen AI 300 released

The Ryzen AI 300 mobile CPUs with Zen 5 architecture officially launched on Sunday. There’s a lot of news to go along with it: a third model has been added to form the top of the range, and we have learned various other architectural details of the laptop version of Zen 5 (and Zen 5c). Including information about the implementation of AVX-512, which as leaked before and will have lower performance than the fully 512-bit desktop Ryzen 9000.

Zen 5 and Zen 5c: Hybrid concept without the usual drawbacks

Ryzen AI 300 uses the new Zen 5 core (which AMD claims delivers 16% better performance per 1 MHz on average). Ryzen 300 AI is also internally referred to as the Strix Point APU and is an implementation of Zen 5 on a monolithic chip (while the desktop Zen 5 for the AM5 socket is chiplet-based). Meanwhile, the manufacturing node is also 4nm like the previous generation Phoenix (and its refresh Hawk Point), more specifically it is TSMC’s N4P node. The die area is quite large – 232.5 mm².

The Strix Point chip contains 12 cores (giving 24 threads thanks to SMT), which represents the first increase in core count since the eight-core Ryzen 4000 “Renoir” from 2020. However, a hybrid combination is used, there are four full-fat big Zen 5 cores, the other eight cores are Zen 5c.

We have dedicated a separate article to the Zen 5 CPU core, where we analyse the changes and improvements AMD has made to this architecture. The company has in the meantime provided further more detailed information, so we have updated and expanded the article with all the new findings:

Read more: Zen 5: AMD’s biggest innovation since first Zen [expanded deep dive]

256-bit units for AVX-512

The main specificity of Zen 5 in the Ryzen AI 300 (Strix Point) processor is that Strix Point’s FPU unit (which also executes SIMD instructions) is optimized for lower performance, but better power consumption and smaller area when compared to the desktop Ryzen 9000 processors. The desktop versions have full 512-bit units with twice the raw computing performance.

For the mobile version of Zen 5 in Ryzen AI 300, AMD used a version of Zen 5 that is said to still utilise 256-bit units like in Zen 4. The core still supports the same 512-bit AVX-512 instructions, but executes them in two passes, like Zen 4. That said, these narrower 256-bit units apply to both the Zen 5 and Zen 5c cores used in Strix Point, so it’s not a specific feature of Zen 5c, but a choice made for the entire APU. In contrast, the server Zen 5c will apparently be 512-bit.

Performance will therefore be improved through the various other architectural improvements in Zen 5, but there will not be the 2× increase in raw maximum theoretical compute performance over Zen 4 that desktop Zen 5 gives. We’ll see what impact exactly this will have in real-world applications only after both processors are reviewed.

Like with Zen 4c, AMD says the efficient Zen 5c cores should have the same IPC (i.e., same performance per 1 MHz) as the big Zen 5 and support all the instructions and technology of the big cores. They differ only by optimizations that lead to smaller area on the chip (e.g. less use of custom logic macros, more density due to the fact that the core is divided into fewer partitions). These optimizations reduce maximum clock speed of the core at the same time.

Presentation of the Zen 5 architecture, Granite Ridge and Strix Point (Author: AMD, via: ComputerBase)

The way it should work is that the big (classic) cores are used for single-threaded and low thread count programs and games, while the little (dense) cores join in multi-threaded tasks that use up all the cores. A laptop processor, which tends to be constrained by a relatively low TDP, will run at significantly lowered clock speeds in such applications in the first place, so the lower clock speed ceiling of the Zen 5c cores may not matter much. And, unlike Intel’s approach, whose small E-Cores have a completely different architecture, there shouldn’t be any serious differences that need to be specially handled, like when Intel had to disable AVX-512 support.

In the Strix Point chip, the Zen 5c cores are about 25% smaller than the large Zen 5 cores, according to AMD. So they are not as efficient in terms of silicon area as Intel’s E-Cores. But the company says their use is aimed at improving power efficiency as much as reducing area. At least in laptops, the processor should prefer to run tasks on Zen 5c cores primarily, and the task should preferably only be moved to classic Zen 5 cores when high performance is needed.

This scheduling is handled by the Windows operating system. The Ryzen AI 300 processor uses a special interface through which its system management unit gives Windows feedback to help allocating tasks more efficiently to one or the other core type. The system should therefore be able to intelligently adapt to the situation and nature of the running application. This technology is different from what Intel processors use (the Thread Director technology), but it performs more or less the same role. In general, however, the scheduling with AMD Ryzen AI 300 processors is much simpler task, as both types have the same architecture, including equivalent SMT support, and the same performance per 1 MHz (except for the L3 cache capacity which can have real impact, that is), and differ only in the maximum clock speed and slightly better power efficiency of the Zen 5c cores.

Presentation of the Zen 5 architecture, Granite Ridge and Strix Point (Author: AMD, via: ComputerBase)

Return to two CCXs

To go with its 12 (4+8) cores, the Strix Point chip has a 24MB L3 cache, compared to just 16MB in the previous generation APUs. However, it’s more complicated than that. In fact, AMD has split the cores into two CCX blocks. The four Zen 5 cores are separated into one CCX, which is given a 16 MB L3 cache (i.e. 4 MB per core as in the desktop version of Zen 5). The efficient Zen 5c cores are separated in a second CCX, which has a separate L3 cache block at the same time, but with only 8 MB of capacity. This gives a total of 24 MB, but a program running on a particular core does not have full access to the entire capacity; the portion of the L3 cache belonging to the second CCX is accessed indirectly via the processor interconnect logic.

Incidentally, this is the first case where AMD processors have such asymmetric CCX design and asymmetric L3 cache capacities per core. Theoretically, it is possible that by tweaking such a configuration in the Strix Point chip, a path to deploying similar hybrid designs in desktops and servers will be opened, so that, for example, some future powerful Ryzen could have one chiplet with “classic” cores and one chiplet with power-efficient dense cores. Or the chiplets could even be directly manufactured with a mix of cores (for example, eight Zen 6 and eight Zen 6c cores). But nothing is known about such future desktop processors yet, so it may be something that will come further in the future and only after the AM5 platform is superseded by a next generation successor.

Presentation of the Zen 5 architecture, Granite Ridge and Strix Point (Author: AMD, via: ComputerBase)

Reduced connectivity optimized for laptops

In addition to the hybrid mix of cores, Strix Point contains one more optimization for use in laptops, which will make its deployment in a desktop version for the AM5 socket a bit more complicated. AMD has cut back on the PCI Express connectivity the chip provides for the sake of the die area and also supposedly to improve power efficiency. It has only 16 PCIe 4.0 lanes, while the previous APU had 20 lanes. So in laptops, there will be ×8 lines for an additional GPU, ×4 PCIe 4.0 lanes for an SSD, and four lanes for other purposes (Wi-Fi, LAN, second SSD).

It is not clear if AMD is preparing some version of these processors for the AM5 desktop socket, something like the Ryzen 9000G. The latter would have to reserve four lanes for the chipset (if it’s not possible to use only two for it) and more lanes for the two M.2 slots that the AM5 platform counts on. This means that a desktop Strix Point processor could realistically only have four lanes to spare for the ×16 slot for a discrete graphics card, else it would have to cut connectivity for the SSDs or chipset.

However, it is possible that AMD will not want to productise a desktop Strix Point APUs due to the large die area of the Strix Point chip making it not economical and an AM5 socket version will not be created. But AMD otherwise states that the reduction in the number of lanes was a decision made specifically for this particular generation, and it’s possible that in future generations there will be room for more lanes again. So maybe a next generation APUs with Zen 6 for desktop again if Strix Point is skipped.

In addition to PCI Express, the chip provides native 40Gbps USB4 support (which is functionally equivalent to Thunderbolt) directly on the chip, the SoC supports two ports. Thus, it will be possible to connect external GPUs in docks via USB4. There is also a trio of USB 10Gb/s (3.2 Gen 2) ports and three USB 2.0 ports (these will typically be used for internal laptop peripherals) provided.

New graphics with RDNA 3.5 architecture

Strix Point / Ryzen AI 300 also brings a significant improvement in the graphical performance it can provide for gaming. Its iGPU has been upgraded to 16 CU blocks (or 8 WGPs) providing 1024 shaders, 16 ray accelerators for raytracing graphics, and 4 Render Backends with 32 ROPs. This is integrated into a single Shader Engine and the GPU has 2MB of L2 cache available. AMD states that you can play games with “quality comparable to gaming consoles” on a laptop with this integrated GPU.

The GPU uses the RDNA 3.5 architecture, which is not entirely new, but contains some improvements over the RDNA 3 used in previous generations. In particular, RDNA 3.5 is designed to achieve better power efficiency. AMD claims that this is what allowed increasing the GPU size to 1024 shaders (a third more) within the same power “budget” as used by the previous generation of processors, which had a core with 768 shaders (12 CU) of RDNA 3 architecture.

The maximum clock speed of the GPU is roughly the same, 2900 MHz, and the GPU delivers a theoretical performance of over 11 TFLOPS. According to AMD, the GPU has roughly 30% better performance, and the vast majority of that is achieved by the added the extra compute units.

Architectural improvements, more efficient compression

In addition to improved efficiency, RDNA 3.5 has some architectural improvements that enhance GPU capabilities. Texture sampling has doubled in performance and accelerates the use of point-sampling. Output from texture units can now be shared by multiple ALUs. Additionally, the ROP units have received some optimizations that improve data locality, which in turn can improve efficiency in subsequent operations as memory accesses are reduced.

Presentation of the Zen 5 architecture, Granite Ridge and Strix Point (Author: AMD, via: ComputerBase)

The computeunits have had their dual-issue capability improved: Some operations can be performed 2 times per cycle that were only possible once per cycle in RDNA 3 (this should cover some comparisons and interpolations). Additionally, the scalar ALUs within the CU have been given support for floating-point operations, and the shaders also have support for eliminating write-once writes to the GPU’s general purpose registers in some cases.

The performance of the GPU is indirectly boosted by the bandwidth given by of LPDDR5(X) memory. The Ryzen AI 300 processors allow LPDDR5X memory at speeds up to 7500 MHz (effective speed), which, with a memory width of 128 bits, delivers throughput of up to 120 GB/s. DDR5 memory is supported at 5600 MHz.

However, real-world performance may be slightly better because RDNA 3.5 graphics also has more efficient compression of data written in and read to and from memory, so it can get a little more effective performance out of a given physical RAM bandwidth. This compression improvement comes down to the GPU being able to use larger blocks of data for compression more often, where the compression ratio is improved compared to smaller blocks.

8K monitor support, DisplayPort 2.1 UHBR 10

This integrated GPU supports driving displays with resolutions up to 8K (7680 × 4320 pixels) at 60 Hz or 4K (3840 × 2160) at up to 240 Hz. Strix Point can also decode and compress 8K video to AV1 and HEVC (H.265) formats. Video outputs supported are HDMI 2.1 and DisplayPort 2.1 at UHBR 10 (this is the slowest DisplayPort 2.1 speed class, but still has 50% higher throughput than DisplayPort 1.4a). In total, the integrated GPU supports the connection of up to four monitors or screens at the same time.

NPU: XDNA 2 unit for AI

As the new processor naming implies, one of the focuses of the Ryzen AI 300 is the AI acceleration provided by the NPU for the needs of Copilot+ PCs, but possibly for other purposes as well. MS requires an NPU with at least 40 TOPS of performance, which Strix Point meets. It should even have the most powerful NPU among its competitors at the moment, at least according to the officially stated performance. It has a spec of 50-55 TOPS when using INT8 precision.

The NPU in Strix uses the new XDNA 2 architecture, which comes from Xilinx technology, with 32 tiles (2x more than the previous generation). The NPU has 60% more integrated working memory and also twice the performance when running multiple AI applications simultaneously, it can handle twice the amount of simultaneous spatial streams. It also supports the so-called 50% sparsity feature, which essentially doubles usable performance by eliminating some of the coefficients (an optimization also used by Nvidia since the Ampere generation of GPUs). Also, support for using more complex functions like tanh and exp within the NPU has been improved over the previous generation NPUs in the Ryzen 7040 and 8040.

The support for Block FP16 format data types should be very useful potentiall , delivering the same performance like INT8 operations, but with the precision of FP16, which would normally have 2x worse performance (i.e. 25-27.5 TFLOPS for the NPU in Strix), while Block FP16 performance is 50–55 TFLOPS. According to AMD, the use of Block FP16 should achieve better results in AI applications without the need for quantization, but with the same performance as INT8 in other NPUs.

Presentation of the Zen 5 architecture, Granite Ridge and Strix Point (Author: AMD, via: ComputerBase)

Compared to the previous generation, the XDNA 2-based NPU should have twice the power efficiency. AMD also states that it is possible to use only part of its blocks and suspend the other parts to conserve energy, which improves power efficiency. This makes it possible to keep running some AI models as background tasks without significantly degrading battery life.

Models

AMD initially announced only two configurations of these processors that will be available in laptops, but later leaks revealed a third SKU, which is now also officially confirmed. The top models in the range are the Ryzen AI 9 HX 370 and the new Ryzen AI 9 HX 375.

These models have a fully enabled configuration with 12 cores and 24 threads, 24 MB L3 cache. The clock speed is 2.0 GHz base, while the Zen 5 cores are capable of clock speeds up to 5.1 GHz in single-threaded applications. Single-threaded performance will therefore be lower than desktop models, which go up to 5.7 GHz.

The integrated GPU is called Radeon 890M and has all 16 CUs / 1024 shaders enabled, with a clock speed of 2900 MHz. The TDP of the processors is adjustable in the range of 15 to 54 W, thus including the previously separate U (15 W) and H (45 W) segments and the P (28 W) segment used by Intel. The default TDP is 28 W, but each laptop’s manufacturer will be able to choose what power target to use with the processor.

There’s no difference in CPU or GPU configuration between the two models, so the newly added Ryzen AI 9 HX 375 isn’t strictly speaking a higher-end model in that regard. What does differ is the performance of the NPU. While the other CPUs have an NPU with a 50 TOPS performance, the Ryzen AI 9 HX 375 has the NPU’s clock speed increased, giving it a performance of up to 55 TOPS. Thus, AI applications accelerated on this unit will process various requests a bit faster. In terms of raw performance, this NPU is the fastest of the Copilot+ PC processors so far – Qualcomm offers 45 TOPS, Intel’s Lunar Lake processors offer from 40 to 48 TOPS depending on the model.

Three AMD Ryzen AI 300 processor models

The cheapest and final SKU for now is called the Ryzen AI 9 365. It has only 10 cores and 20 threads, with the dense Zen 5c cores cut, while the big cores retain their full count, making the configuration 4+6. The base clock speed of the CPU cores is 2.0 GHz, with a maximum boost of 5.0 GHz.

The cache is kept at the full 24MB, but the GPU (labelled Radeon 880M) is cut down to just 12 CUs, so it has 768 shaders. However, the clock speed is the same 2900 MHz. This processor also has a default TDP of 28 W, but with an option to adjust from 15 to 54 W. NPU performance is also 50 TOPS for this model.

Information has also leaked on the internet that an even more stripped-down (and cheaper) Ryzen AI 7 360 SKU with eight cores could also appear, possibly with a 3+5 configuration. This SKU is not yet confirmed. It may only become available later, if it exists.

Market launch on 28. 7.

These processors and with it the laptops have officially been released on Sunday, including reviews. Which means they are available on the market before the desktop version. The desktop Ryzen 9000 processors have suffered a slight delay and will come out on the 8th and 15th of August – first the Ryzen 5 and Ryzen 7 SKUs and then Ryzen 9 a week later.

Tip: Ryzen 9000 delayed due to last-minute problems

Source: AMD, ComputerBase

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Leave a Reply

Your email address will not be published. Required fields are marked *