Jim Keller’s new firm plans RISC-V CPUs with Apple-like wide cores

Tenstorrent Ascalon: Extremely wide core can take RISC-V architecture to the highest performance segment

RISC-V processors are still yet to reach above the embedded sector, but with the current developments around ARM, they may be closer to that goal than we think. They ISA might even come to the highest performance processor segment currently ruled by Intel and AMD that ARM itself (with the exception of Apple) is still just trying to crack. Tenstorrent, led by Jim Keller, is now developing processors that could be close to those from Apple.

There has been a lot written about Jim Keller. This engineer and manager with experience from a large number of influential companies was one of those who helped bridge AMD from their Bulldozer era to Zen, but then moved on to Tesla and (much to shock of many) Intel. However, he didn’t stay at Intel for long, and then in 2020 he found himself at the startup Tenstorrent. This January, he became its CEO, so it looks like this time around he doesn’t plan to stay only for his typical two or three years, but wants to be at the company for longer.

Tenstorrent bounced from AI to CPUs

Tenstorrent started as an AI accelerator startup, but now seems to be focusing on RISC-V processors as well. And given that Keller (but also other important players) has been involved in hugely successful architectures like AMD’s K8, Zen and Apple’s processors, this could turn out to be interesting.

Tenstorrent has announced that it is working on powerful processors with the RISC-V instruction set that will user their own microarchitecture called Ascalon and compete with x86 and ARM in servers. The Ascalon architecture will be a very wide out-of-order core, promising high IPC. So it follows a similar path to Apple’s architectures successful in mobile SoCs and M1/M2 processors.

Jim Keller (Source: Intel)

Tenstorrent actually develops several different cores, apparently based on a common foundation. These cores differ in size, complexity and, of course, performance, which makes them suitable for different purposes. The weakest version has 2-wide instruction decoding, then there are 3-wide, 4-wide, 6-wide (Alastor architecture) and finally there’s the 8-wide Ascalon core with the ability to process (decode) eight instructions per cycle. Although the company is mainly talking about server applications, the Ascalon core could also be usable in HPC and laptop processors, according to the slides.

Tenstorrent RISC-V cores (source: Tom’s Hardware)

Ascalon has, as mentioned, 8 instruction decoders, so it can process eight RISC-V instructions per clock, which is the same decode capability as in Apple’s current large cores. The execution backend will also be just as wide as Apple’s. Ascalon has six integer ALUs and two branch execution units. The load/store part for memory reads and writes is slightly weaker, there will be three pipelines compared to Apple’s four (so they can handle a combination of three reads or writes per clock, the exact distribution of load and store units is unknown).

The core then has two FPU pipelines for floating-point computations, which also serve as vector (SIMD) units. The core has a 64-bit RV64ACDHFMV instruction set architecture – so it supports vector instruction set extensions, which were long delayed in the RISC-V architecture camp, as well as virtualization.

The two SIMD units of the Ascalon core have a width of 256 bits (like x86 AVX/AVX2), so in theory they can be equal in compute throughput to the four 128-bit units of Apple cores in fully optimized code, but Apple’s solution is more flexible and will probably have an advantage in practice. In any case it’s still not as high SIMD performance as the best current x86 architectures allow.

The core is said to have advanced TAGE-type branch predictors (these are a must for powerful CPUs), but we’ll see if the company can come close to the leaders with decades of experience in this. We don’t know all the cache parameters yet, but the L1 for data will apparently have a very large capacity (128KB, 8-way associativity) similarly to Apple core again. Fetch from the instruction cache is supposed to be 32 bytes per cycle, the processor will of course be able to do various prefetching.

Tenstorrent’s Ascalon CPU architecture schematic (source: Tom’s Hardware)

According to Tenstorrent, load/store units will have deep queues, but values were not disclosed. We also don’t know how deep the reorder buffer will be, but we can probably expect that for an architecture with such an ambitious “width” it could also be very deep (Apple processors are around 600 instructions), which would then allow for very high IPC. Judging by L1 cache capacity or width, Tenstorrent seems to intend to follow the same recipe as Apple. In contrast, there is no mention of SMT capability anywhere.

The absolute performance of this core will depend on what clock speeds can be achieved, as IPC alone is not enough. High clocks may not work out in the first generation, but the next generations can then gradually increase the clock speeds. Tenstorrent seems to have ambitions to really compete with the most powerful microarchitectures in the current processor market. In servers, even relatively lower clock speeds (around 3 GHz) might just be enough, as there is not really a need for high single-threaded boosts.

Lots of veterans on board, now including Raja Koduri

With startups there is always a good chance that the big plans will not come to fruition, but Tenstorrent seems to have a pretty good foundation – it has been active for a while and besides Jim Keller, the other executives are highly experienced too. Ascalon has as it’s chief architect Wei-Han Lien, who went through NexGen, the x86 manufacturer that AMD bought, then AMD itself (the K6 architecture came from NexGen’s team when AMD acquired the company), then he went with Keller through PA-Semi and Apple, where he worked on the A6, A7 and perhaps even the M1 chips.

Another name you’ll be familiar with has just joined Tenstorrent: Raja Koduri, who quit Intel not a long ago. Based on his statements, he is planning his own startup using AI for computer game graphics, but at the same time he has now also become one of Tenstorrent’s board members. However, this is not a position within the company where he directly works on products, nor is it a full-time job entirely.

Suggestion: The departure or sacking of Raja Koduri? The founder, face and moving force of Intel’s standalone GPU efforts is out

Tenstorrent cores should be licensable, so at least the smaller versions can be used in embedded, but also client processors by other companies, in which they will provide competition to the IP cores from SiFive, but also many other companies that develop RISC-V microarchitecture IP. Tenstorrent might even become the leading player in RISC-V IP market, but that remains to be seen.

Sources: Tom’s Hardware (1, 2)

English translation and edit by Jozef Dudáš

Flattr this!

Oryon, the Nuvia ARM core of Snapdragon X: Architecture analysis

At Computex 2024, Intel introduced the new Lion Cove and Skymont architectures, which we covered in detail. AMD also shared a peek at their competing Zen 5 core, but with little detail, so we’ll have to wait with our analysis of the architecture. But there’s a new ARM-based challenger entering the fray – the Snapdragon X Elite currently coming to laptops. And Qualcomm has now also finally teased its “Nuvia” Oryon architecture. Read more “Oryon, the Nuvia ARM core of Snapdragon X: Architecture analysis” »


Skymont architecture analysed: Intel little core outgrows the big?

Intel unveiled their next-gen Lunar Lake mobile processor at Computex 2024. It will power Copilot+ PCs with its NPU and is supposed to be very power efficient, but it’s extremely interesting mainly because of the new CPU architectures, which will power future Arrow Lake desktop CPUs. Ironically, the star of this generation might actually be the little efficient E-Core accompanying the big P-Cores. Its architecture seems to have taken a giant leap. Read more “Skymont architecture analysed: Intel little core outgrows the big?” »


Intel’s new P-Core: Lion Cove is the biggest change since Nehalem

Intel revealed its next-gen Lunar Lake mobile processor at Computex 2024, to be released this summer. It will power Copilot+ PCs with its fast NPU and is supposed to be highly power efficient, but it’s also extremely interesting because its new CPU architectures are also coming to future Arrow Lake desktop CPUs. First up, we’ll take a look at the big P-Core architecture, which represents the biggest changes in many years. Read more “Intel’s new P-Core: Lion Cove is the biggest change since Nehalem” »


Leave a Reply

Your email address will not be published. Required fields are marked *