Batttlemage: Details of Intel Xe2 GPU architecture [Analysis]

Intel has unveiled the new Arc graphics cards, dubbed Battlemage or also “B-Series”, featuring a vastly improved architecture, giving Arc a second chance to gain favour with gamers – though it will have to do this mainly through low prices, as the newly released GPUs will only compete in the bracket of cheaper mainstream graphics. We’ve dedicated this article to the cards themselves now we’ll have a look at the architecture itself.

The Battlemage cards are based on chips manufactured using TSMC’s 5nm process node (the N5 process according to the specifications) with Xe2 architecture. Sometimes the Xe2 HPG designation is also used to distinguish it from the Xe2 LPG version already integrated in Lunar Lake processors. The architecture should deliver noticeably better performance than the original Xe HPG in the Alchemist cards at the same number of compute units.

Xe2 is quite an improvement over the Xe1 architecture used in the first Arc cards. Intel says that in that generation, the focus was on the task of “scaling up” the GPU architecture, which originally came from lower performance integrated graphics cores, to a larger “width” of compute units and memory for the first time ever. Xe2 is already a much more of a “native” standalone GPU and there was an opportunity to better optimize the architecture for higher performance discrete graphics cards. Hopefully this also means, for example, reduced power consumption when idle and better compatibility (that’s where Gen A Arcs had a significant issue of not working well without PCIe Resizable BAR support), which we suppose is something to look at when reviews hit.

Xe2 GPU architecture in Intel Arc generation Battlemage graphics cards: Xe2 is claimed to have improved efficiency at many levels of gaming graphics processing

The Xe2 architecture is supposed to require lesser software overhead in the drivers, so it will require less CPU performance. And games running on it are supposed to make better use of the units provided by the chip and distribute the work more efficiently between them, whereas in the previous Alchemist GPUs (Xe HPG architecture) the use of hardware resources is less efficient.

Xe2 GPU architecture in Intel Arc Battlemage graphics cards

The architecture is said to be optimized to improve the latency of individual operations and reduce stalls and under-utilization when processing. According to Intel, performance per GPU “core” (Xe Core) is up to 70% better with Xe2 compared to the Xe1/Alchemist architecture, and power efficiency (performance/power consumption ratio) is up to 50% better.

Render slice of Xe2 Battlemage architecture

The basic building blocks of the architecture are Xe Cores and the so-called Render slice, which contains four Xe Cores. One Xe Core provides 128 shaders in eight XVE vector units – these natively execute operations with SIMD16 width, improving efficiency over the SIMD8 previously used. The Xe Cores have dedicated L1 caches and there are also XMX units attached to them for matrix operations (AI acceleration, these units are basically the equivalent of Nvidia’s tensor cores). One Xe Core has 8 XMX units (with a total width of 2048 bits), so one Render slice contains 32 in total.

On XVE and XMX units, it is possible to perform matrix operations with the FP16, BFloat16, INT8, INT4 and INT2 data types, whereas FP32 and FP64 calculations (although probably with reduced performance) and more complex mathematical operations (Sin, Cos, Log, Exp) are supported only on general-purpose shaders units (the XVEs). Within Xe Cores, it should be possible to simultaneously co-issue XMX unit opsand general-purpose shader ops, and in addition it should be possible to simultaneously co-issue integer and floating-point operations in XVEs.

Powerful ray tracing accelerators

There is also one RTU for ray tracing acceleration present per Xe Core within each Render Slice (4 per Render Slice). The ray tracing acceleration is improved in the Xe2 architecture compared to the Alchemist generation, where it was already at a good level. Each RTU has a 16kB cache for BVH elements and three traversal pipelines (versus two in Alchemist), with which it can handle a total of 18 intersections with auxiliary BVH boxes per clock cycle (50% more than in Alchemist) and two intersections with triangles per cycle. For comparison – in AMD RDNA 2 and RDNA 3 it’s four boxes and one triangle per cycle (RDNA 4 should hopefully be able to do double that, but it’s not confirmed yet), and in the Ada Lovelace architecture in Nvidia GeForce graphics cards it’s four boxes and four triangles per cycle. In any case, Battlemage/Xe2 has a very generously provisioned ray tracing acceleration, at least on paper.

Intel has also beefed up the geometry engine, which, along with the samplers and rasterizer, sits outside the individual Render slices. In geometry, Battlemage can handle 3x more vertex fetches than Alchemist, and the performance of mesh shaders is 3x higher. The new architecture also has 2× better blending performance and 2× better performance in non-filtered texturing. Texture sampling is done in an out-of-order style.

The Xe2 slice also has one-third larger pixel color cache and 50% larger HiZ/Z/Stencil cache. It supports prefetching of render targets and has improved graphics primitive culling within HiZ to save wasted compute cycles on objects that are not visible in a scene. Also improved should be the data compression in the L2 cache of the GPU. The Command front end that allocates work to compute units has also seen improvements, now natively supporting Execute indirect.

Two upcoming GPUs?

So far, two graphics cards have been revealed that are based on the BMG-G21 chip. It contains five “Render Slices”, i.e. 20 Xe Cores, 20 RTUs and 160 XMX units, 20 texture units and 10 ROPs (pixel backends). The GPU uses a 192-bit memory bus using GDDR6 memory and the efficiency of memory operations is supported by an 18 MB L2 cache.

The GPU also includes two independent multimedia engines with support for H.264, H.265 (HEVC), AV1 (including compression), VP9, and XAVC-H (Sony’s professional format) acceleration. Unlike Lunar Lake, VVC format acceleration is not present.

According to unofficial leaks, Intel was previously planning three Xe2-based GPUs. There was supposed to be another lower-tier BMG-G10 chip, which would probably feature 128-bit (or 96-bit?) memory bus, but this die has reportedly been scrapped with no plans to put it into procudtion – it would have covered a relatively low price spectrum with limited sales and margin opportunities. Above the G21, on the other hand, was originally meant to be a third, larger BMG-G31 die, which was to provide 32 Xe Cores (8 Render slices) with 4096 shaders and 256-bit memory bus. So this GPU could go be perhaps 50% faster.

DisplayPort 2.1 and HDMI 2.1, but without PCIe 5.0 and VVC

The GPU connects to the system via a PCI Express 4.0 ×8 interface, so it uses a narrower interface (for cost savings). Recently there were reports that PCI Express 5.0 might already be supported, but this turned out to not be the case for at least the BMG-G21 and the Arc B580 and B570 cards, as they officially use Gen4 PCIe. This is unlikely to be a problematic limitation for this graphics card in practice. The BMG-G31 is likely to have the full 16 lanes (and we do not know whether it might already have PCIe 5.0 enabled).

On the other hand, the BMG-G21 chip is capable of the latest DisplayPort 2.1 display output, which is supported at UHBR 13.5 speed (just like on Radeon RX 7700 XT and above, while the cheaper Radeon RX 7600 and 7600 XT that these Intel graphics cards will actually compete with are only capable of the slower DP 2.1 UHBR 10 version). It’s possible that the possibly coming more powerful BMG-G31 could already be capable of DP 2.1 at the top speed of UHBR 20. The G21 chip can handle up to four outputs – a trio of DP 2.1 and one HDMI 2.1.

Updated: DisplayPort 2.1 UHBR 13.5 is only supported on one output interface, the other two ports only support DisplayPort 2.1 UHBR 10 (as in the Alchemist generation). We wrote about the different DP 2.1 levels here if you need details. In general, it should be noted that even the UHBR 10 speed provides about 50% better bandwidth for image data versus DP 1.4a on older GPUs.

Almost up against RDNA 4 and Blackwell…

Graphics cards based on the BMG-G21 chip are coming out next week on 12/13 (Arc B580) and next month on 1/16 (the cheaper Arc B570). It is possible that further models with this chip will appear in the future, either for the gaming market or for the professional “workstation” segment. It is not yet confirmed when will any graphics cards with the BMG-G31 die come to the market (or if they even are coming at all). If that bigger GPU is still in the pipeline, it will probably come out a bit later. However, this means that it will probably have to face the new generation of architectures from AMD (RDNA 4) and Nvidia (Blackwell), which are expected to be released in the first quarter of 2025, possibly as early as January.

This is a weakness the Xe2 and Battlemage suffers from. Like the previous Alchemist generation, Intel has released these graphics cards with a considerable delay, their are launching at the point when their intended generational competitors from Nvidia and AMD have two years on the market behind them and are just about to be replaced by new generations that will raise the bar again…

Source: Intel

English translation and edit by Jozef Dudáš

⠀