Intel unveiled the Core Ultra 300 “Panther Lake” processors last week, which we have covered here. What we left aside is that Panther Lake also debuts a new GPU architecture, Xe3. Aside from the move to a 1.8 nm node, this is arguably the biggest change Panther Lake introduces, as the CPU cores appear mostly unchanged. Xe3 is Intel’s most advanced GPU to date and in this article, we take a closer look at what it brings.
New Arc GPU architecture: Coming of Celestial?
Panther Lake is the first product to feature the third generation of Intel GPU architectures when counting from the point when the Arc GPUs launched and offered something that could finally be considered truly “gaming” graphics (as opposed to older, less advanced architectures that the company produced many of previously). The first (Xe1) architecture was introduced by the discrete Arc “Alchemist” graphics cards (or “A-Series”), and it is also used in Meteor Lake and Arrow Lake processors. The second (Xe2) arrived in Lunar Lake processors and the discrete Battlemage graphics cards (Arc “B-Series”, i.e., B570 and B580 cards).
Panther Lake processors integrate the Xe3 architecture, which constitutes a full-fledged new generation according to Intel. However, the company does not yet designate it as the Celestial generation (C-Series). Intel stated that the Celestial name will be used for a future enhanced version of Xe3, which carries the internal designation Xe3P. Despite Xe3 likely deserving to be considered a new architecture, the GPUs in Panther Lake processors will formally be classified under the previous Battlemage generation.
The reasons are probably not technical. One can assume that Intel wants to launch Celestial with greater fanfare only when it has some more powerful GPUs of this architecture available. Or it could be that the company still plans to release further discrete GPU that belong into the Xe2 / Battlemage family—namely, the Arc B770 cards with the BMG-G31 chip. It wouldn’t serve the marketing of such cards if Intel released them under the Battlemage brand but already had integrated GPUs from the Celestial family alongside them, which would be perceived as a generation newer product.

Intel presented its GPU architecture roadmap during the Panther Lake unveiling, which officially promises the Xe3P architecture. According to it, Xe3P is slated to come out as the “next generation of Arc,” which doesn’t explicitly say it will be labeled Celestial, but after “B-Series,” the next in line is “C-Series,” which mostly rules out anything else. The roadmap, however, does not say when Celestial/Xe3P GPUs will come to market—it could be next year or the year after, we don’t know. And Intel also has not publicly confirmed yet that there even will be any discrete GPUs in the Celestial generation. The company could only produce integrated Celestial GPUs in theory, and discrete Arc graphics cards could be canceled.
Perhaps we should add that this terminological charade around Arc GPUs is not some desperate last-minute maneuvering. It seems this was planned at least one year back, if not for longer. Already last year, leaker Kepler_L2 pointed out that discrete Celestial GPUs would not have the Xe3 architecture and that it would only be used in the integrated GPUs of Panther Lake processors. This could have even led to various rumors claiming that discrete Celestial GPUs were canceled that circulate on the internet, although in reality was that they woul just use a one number newer architecture. Back then we thought it would be Xe4, however, this old information was evidently referring to what has now been officially shown in the roadmap with the Xe3P architecture.
Xe3 architecture analysis
After this introductory digression, let’s move directly to what Xe3 brings to the table now (or will bring in a few months, when the processors are released). The overall scheme looks similar to Xe2; the counts and ratios of units in the basic building block, the Xe Core, haven’t changed. But Intel has improved their capabilities.
Each Xe Core, as the basic unit, contains eight XVE (Xe Vector Engine) vector units, which house the actual shaders for graphics and general-purpose computing. One XVE unit processes 16 values at once (these are SIMD16 units, thus narrower than those used by Nvidia and AMD GPUs, that have SIMD32 units as standard today). So, in total, one Xe Core provides 128 shaders. The XVE units further include XMX (Xe Matrix Extensions) units for accelerating artificial intelligence matrix ops. There is always one XMX per XVE unit, so 8 XMX per Xe Core.

Each Xe Core is associated with one set of texturing units with 8 TMUs (called a Sampler) and one RTU (Ray Tracing Unit) for ray tracing graphics. The ROP units are grouped into so-called Pixel Backends, where each should apparently contain 8 ROPs.
Better utilization of shaders
While the unit counts (and thus the theoretical performance per shader at a given clock speed) don’t change, the architecture should extractbetter real-world performance from a given theoretical performance in FLOPS, in practice.

The improvements in Xe3 over Xe2 in the general-purpose compute units (shaders within the XVE) lie in the capability for variable (i.e., more flexibleú register allocation. This should improve the utilization (“occupation”) of compute resources over time, in other words, get more real performance out of them, as the units don’t have to “spill” registers to memory as often. According to graphs shown by Intel, variable register allocation is one of the main sources of improvement in the Xe3 architecture. It seems Intel took a similar direction as the RDNA 4 architecture from AMD, where dynamic register allocation was a new feature.

Running tasks will also be able to utilize a larger cache—the L1 Cache and Shared Local Memory (SLM), which are always shared per Xe Core (i.e., for 8 XVE and 8 XMX), have grown by 33% compared to the Xe2 architecture.
The compute unit should be capable of maintaining more threads running simultaneously—up to 25% more. GPUs generally need to have more active threads to keep the units busy, for example, when data is missing due to waiting for memory reads, so this should help GPU utilization.
AI acceleration
The XMX units don’t seem to have major changes. They support operations with the TF32 32-bit data type, FP16 and BFloat16 16-bit data types, and INT8, INT4, and INT2 integer values. The raw performance seems unchanged—one Xe Core (8 XMX units) can perform 1024 computations on 32-bit, 2048 computations on 16-bit, and 4096 computations on 8-bit data types. The performance with INT4 and INT2 is 8192 operations per cycle (thus, 4-bit data types still double the performance, but using the 2-bit INT2 data type does not, its effect is only saving memory capacity occupied by the AI model).

It seems that support for FP8 computations is still missing on the XMX units, which is something AMD introduced in the RDNA 4 architecture of the Radeon RX 9000 graphics cards and a feature already used for the FSR4 AI upscaling technology. Nvidia supports FP8 on GeForce RTX 4000 / Ada Lovelace architecture and GeForce RTX 5000 / Blackwell architecture. Intel states that Xe3 supports FP8 dequantization, but the actual matrix units probably do not support the data type yet. However, INT8 ops can be used, which will have the same performance, though integer values have a certain disadvantage in the output quality the AI produces because they lack the dynamic range of floating-point numbers.
Ray tracing
The RTU units have been improved, despite their basic raw performance remaining be the same. They have three pipelines for analyzing ray-BVH intersections, which together can handle up to 18 intersections per cycle (a very high number), and two pipelines for processing triangle intersections, which should probably be able to process two triangles per cycle.
Intel states that the Xe3 architecture should achieve 2x more triangle intersections in some kind of microbenchmark, but it’s not clear what is meant by this. It could mean that each pipeline can now handle two triangles per cycle (so the RTU could perform four as a whole), or it could just be referring to results of measurement in some real workload, showing that the actual achieved performance improves, even though the theoretical capacity of the units is still the same 2 ray/triangle intersections per cycle.

Intel has added the capability for dynamic management of rays for use in asynchronous ray tracing processing. So yet again the improvements lie in enabling the GPU to extract more practical performance out of the same theoretical performance.
Other improvements
Performance of various fixed function parts of the graphics pipeline has also been improved. Xe3 has up to 2x better performance in anisotropic texture filtering and up to 2x better performance for stencil test operations. These are likely areas where analysis showed they were limiting overall gaming performance in previous architectures.
Intel has also added a new URB (Unified Return Buffer) manager to the architecture. URB is a block that processes the results of graphics operations and passes them to other parts of the GPU. The improved design allows the data in this buffer to be updated partially without requiring the buffer to be flushed of previous data before every context switch. This was a potential performance limit in the communication between different parts of the GPU, which should now be eliminated or mitigated.

Various software-side improvements are also coming. Intel, for example, is introducing the ability into its drivers to download ready-made pre-compiled game shaders from the cloud (so this is only available only if you are online), instead of the game having to compile them locally on the computer. This will speed up game loading on first launch or after driver and game updates.

XeSS 3 with multi-frame generation
Intel also announced a new generation of its upscaling and frame generation technology—XeSS 3—alongside Panther Lake and Xe3 GPUs. It can now generate multiple frames following the example of Nvidia’s DLSS 4. Intel confirmed XeSS 3 will be available in all games that already support the single-frame generation via the previous XeSS 2 technology. In such games, the driver will be able to automatically update the game to use XeSS 3 as well. We wrote about this technology here:
Base version of the GPU: 4 Xe Cores
As we already wrote in the previous article, Intel has designed two different integrated GPUs with the Xe3 architecture for Panther Lake processors. In cheaper processors and those intended to be paired with a discrete GPU, a chiplet ()tile manufactured on Intel’s 3nm process with 4 Xe Cores will be used.
This variant is composed of two so-called Render Slices, where each contains two Xe Cores. So in total, this GPU has 512 shaders, 32 XMX for AI acceleration, and 4 RTUs for ray tracing. There are 32 texturing units (four Samplers) and 16 ROP units (two Pixel Backends), with one geometry pipeline. This GPU has 4MB of L2 cache for its needs.

According to some estimates, this GPU might also be used next year in most Nova Lake processors (including desktop ones). Again, these would be the processor models that don’t need a more powerful GPU, so “recycling” this tile for them makes sense.
Performance version of the GPU: 12 Xe Cores
The more powerful version of the integrated GPUs, which will be interesting for gaming laptops and handhelds, is also composed of two Render Slices, but each contains 6 cores, for a total of 12 Xe Cores. This gives the GPU 1536 shaders (96 XVE), 96 XMX, and 12 RTUs. There are 96 texturing units (12 Samplers). Intel also gave this GPU a pair of geometry pipelines and four Pixel Backends, which should mean 32 ROP units.

This GPU version uses TSMC’s 3nm process (N3E), which is likely better than the Intel 3nm technology used in the cheaper version (it will achieve better power efficiency and clocks).
The GPU with 12 Xe Cores has a fairly large 16MB L2 cache, which is probably the largest capacity seen in an integrated GPU so far (with the exception of AMD’s Strix Halo). This should be one of the potential trump cards. Powerful integrated GPUs often hit a wall with system memory bandwidth, which they use instead of dedicated VRAM and which prevents them from achieving higher performance. A larger cache capacity can mitigate this deficit by covering accesses to the most frequently used data.
Incidentally, Lunar Lake had an L2 cache capacity of 8 MB for its GPU with 1024 shaders (8 Xe Cores), so in Panther Lake, this is a strengthening beyond simple percentage scaling according to the number of compute units.
L2 cache against memory bandwidth bottleneck
A reminder: the large Panther Lake GPU uses LPDDR5X-9600 memory, which provides a bandwidth of 153.6 GB/s. According to Intel’s testing, the 16MB L2 cache of this GPU can reduce the demand on main memory bandwidth by up to 36% compared to a hypothetical alternative with just 8MB of cache.

This number is from the Black Myth: Wukong game; in Cyberpunk 2077 with ray tracing, the reduction in bandwidth demands is 19% or 27% when both ray tracing and XeSS upscaling are enabled simultaneously. In 3DMark Steel Nomad, the benefit is a bit lower, only -17% of consumed bandwidth.
Clock speeds could see an increase
The clock speeds that the Xe3 architecture will achieve are not yet clear. Intel mentions that Panther Lake—probably the variant with 12 Xe Cores—should deliver up to 120 TOPS of AI performance (likely a figure for INT8 precision computations) on the XMX units. Intel quoted 67 TOPS for Lunar Lake with its Xe2 GPU of 1024 shaders at a clock of 2.05 GHz. For the Panther Lake GPU with 1536 shaders, 120 TOPS should correspond to a clock speed of approx. 2.45 GHz.
This might indicate that Intel managed to increase the clock speeds (which would be good, because previous architectures lagged behind the competition in performance per unit of silicon area, and higher clocks can improve this). Unless we have an error in estimation somewhere. We can’t rule out the explanation that this is just and effect the particular combination of TDPs and clock speed has Intel selected to write down in the specifications. The Xe2 architecture is also capable of higher clock speeds than in Lunar Lake when implented in a discrete GPU (the gaming clock of the Arc B580 graphics card is around 2.85 GHz).

Highest performance among (mainstream) iGPUs?
Intel so far only shows a very basic indication of performance. In the following graph, you can see a comparison of curves showing performance at a given power consumption for Xe1 / Alchemist (Arrow Lake-H processor with 1024 shaders in the GPU) integrated GPU, Xe2 / Battlemage (Lunar Lake processor with 1024 shaders in the GPU) integrated GPU, and the Xe3 integrated GPUs in Panther Lake, which should be the GPU with 1536 shaders.
Xe3 should achieve significantly higher absolute performance than both older generations. While it’s visible that the maximum power consumption is higher than that of Arrow Lake-H, not to mention the efficient Lunar Lake, the performance-per-watt ratio is better (so performance should increase by a larger percentage). According to Intel, this should hold true across the entire curve, meaning for every specific power setting.

For example, the maximum performance of the GPU in Arrow Lake-H can be achieved with the Xe3 GPU (when underclocking) with a power consumption 40% lower. However, this is with a better process: the Arrow Lake-H GPU chiplets are manufactured on TSMC’s 5nm process, while the Panther Lake GPU chiplet with 12 Xe Cores uses TSMC’s 3nm process (in the improved N3E version).
The highest absolute performance of the GPU in Panther Lake is then supposed to be more than half (50%) higher than that of the GPU in Lunar Lake at the highest point on the curve. This means that the version with 12 Xe Cores, for which these data are given, should also comfortably beat the integrated GPU in AMD’s Ryzen AI 300 “Strix Point”, which has only 1024 shaders (of the RDNA 3.5 architecture). Lunar Lake is already very close to Strix Point (if the drivers work well, it even tends to be slightly faster), so the 12 Xe Cores of the Xe3 architecture is almost certain to always be substantially ahead od Strix Point in performance comparisons.

The performance likely won’t reach the level of Strix Halo processors (Ryzen AI Max), but those are already a different category both in performance and complexity with their 256-bit memory, which in the PC world is still somewhat experimental, and AMD has a problem convincing laptop manufacturers to use these “Halo” chiplet APUs.
Panther Lake with 12 Xe Cores could still be a relatively mainstream processor that could make it into a larger number of laptops and is more comparable to regular Ryzen AI 300 “Strix Point” APUs. In this category of x86 processors, Panther Lake will likely be the processor with the most powerful integrated GPU when it is released at the beginning of 2026. AMD will have a chance to respond only with the release of the next generation of laptop processors based on Zen 6—the chiplet Medusa Point APUs. Those, however, will most likely come out no earlier than the second half of the year, perhaps even at the beginning of 2027. And even their graphics performance is not very certain yet, as according to some information, they might not yet use a newer GPU architecture, instead still relying on RDNA 3.5.
Sources: Intel, ComputerBase, Tom’s Hardware
English translation and edit by Jozef Dudáš







V nadpise je gramatická chyba.
Opravené, díky! 🙂