Jim Keller’s new firm plans RISC-V CPUs with Apple-like wide cores

Tenstorrent Ascalon: Extremely wide core can take RISC-V architecture to the highest performance segment

RISC-V processors are still yet to reach above the embedded sector, but with the current developments around ARM, they may be closer to that goal than we think. They ISA might even come to the highest performance processor segment currently ruled by Intel and AMD that ARM itself (with the exception of Apple) is still just trying to crack. Tenstorrent, led by Jim Keller, is now developing processors that could be close to those from Apple.

There has been a lot written about Jim Keller. This engineer and manager with experience from a large number of influential companies was one of those who helped bridge AMD from their Bulldozer era to Zen, but then moved on to Tesla and (much to shock of many) Intel. However, he didn’t stay at Intel for long, and then in 2020 he found himself at the startup Tenstorrent. This January, he became its CEO, so it looks like this time around he doesn’t plan to stay only for his typical two or three years, but wants to be at the company for longer.

Tenstorrent bounced from AI to CPUs

Tenstorrent started as an AI accelerator startup, but now seems to be focusing on RISC-V processors as well. And given that Keller (but also other important players) has been involved in hugely successful architectures like AMD’s K8, Zen and Apple’s processors, this could turn out to be interesting.

Tenstorrent has announced that it is working on powerful processors with the RISC-V instruction set that will user their own microarchitecture called Ascalon and compete with x86 and ARM in servers. The Ascalon architecture will be a very wide out-of-order core, promising high IPC. So it follows a similar path to Apple’s architectures successful in mobile SoCs and M1/M2 processors.

Jim Keller (Source: Intel)

Tenstorrent actually develops several different cores, apparently based on a common foundation. These cores differ in size, complexity and, of course, performance, which makes them suitable for different purposes. The weakest version has 2-wide instruction decoding, then there are 3-wide, 4-wide, 6-wide (Alastor architecture) and finally there’s the 8-wide Ascalon core with the ability to process (decode) eight instructions per cycle. Although the company is mainly talking about server applications, the Ascalon core could also be usable in HPC and laptop processors, according to the slides.

Tenstorrent RISC-V cores (source: Tom’s Hardware)

Ascalon has, as mentioned, 8 instruction decoders, so it can process eight RISC-V instructions per clock, which is the same decode capability as in Apple’s current large cores. The execution backend will also be just as wide as Apple’s. Ascalon has six integer ALUs and two branch execution units. The load/store part for memory reads and writes is slightly weaker, there will be three pipelines compared to Apple’s four (so they can handle a combination of three reads or writes per clock, the exact distribution of load and store units is unknown).

The core then has two FPU pipelines for floating-point computations, which also serve as vector (SIMD) units. The core has a 64-bit RV64ACDHFMV instruction set architecture – so it supports vector instruction set extensions, which were long delayed in the RISC-V architecture camp, as well as virtualization.

The two SIMD units of the Ascalon core have a width of 256 bits (like x86 AVX/AVX2), so in theory they can be equal in compute throughput to the four 128-bit units of Apple cores in fully optimized code, but Apple’s solution is more flexible and will probably have an advantage in practice. In any case it’s still not as high SIMD performance as the best current x86 architectures allow.

The core is said to have advanced TAGE-type branch predictors (these are a must for powerful CPUs), but we’ll see if the company can come close to the leaders with decades of experience in this. We don’t know all the cache parameters yet, but the L1 for data will apparently have a very large capacity (128KB, 8-way associativity) similarly to Apple core again. Fetch from the instruction cache is supposed to be 32 bytes per cycle, the processor will of course be able to do various prefetching.

Tenstorrent’s Ascalon CPU architecture schematic (source: Tom’s Hardware)

According to Tenstorrent, load/store units will have deep queues, but values were not disclosed. We also don’t know how deep the reorder buffer will be, but we can probably expect that for an architecture with such an ambitious “width” it could also be very deep (Apple processors are around 600 instructions), which would then allow for very high IPC. Judging by L1 cache capacity or width, Tenstorrent seems to intend to follow the same recipe as Apple. In contrast, there is no mention of SMT capability anywhere.

The absolute performance of this core will depend on what clock speeds can be achieved, as IPC alone is not enough. High clocks may not work out in the first generation, but the next generations can then gradually increase the clock speeds. Tenstorrent seems to have ambitions to really compete with the most powerful microarchitectures in the current processor market. In servers, even relatively lower clock speeds (around 3 GHz) might just be enough, as there is not really a need for high single-threaded boosts.

Lots of veterans on board, now including Raja Koduri

With startups there is always a good chance that the big plans will not come to fruition, but Tenstorrent seems to have a pretty good foundation – it has been active for a while and besides Jim Keller, the other executives are highly experienced too. Ascalon has as it’s chief architect Wei-Han Lien, who went through NexGen, the x86 manufacturer that AMD bought, then AMD itself (the K6 architecture came from NexGen’s team when AMD acquired the company), then he went with Keller through PA-Semi and Apple, where he worked on the A6, A7 and perhaps even the M1 chips.

Another name you’ll be familiar with has just joined Tenstorrent: Raja Koduri, who quit Intel not a long ago. Based on his statements, he is planning his own startup using AI for computer game graphics, but at the same time he has now also become one of Tenstorrent’s board members. However, this is not a position within the company where he directly works on products, nor is it a full-time job entirely.

Suggestion: The departure or sacking of Raja Koduri? The founder, face and moving force of Intel’s standalone GPU efforts is out

Tenstorrent cores should be licensable, so at least the smaller versions can be used in embedded, but also client processors by other companies, in which they will provide competition to the IP cores from SiFive, but also many other companies that develop RISC-V microarchitecture IP. Tenstorrent might even become the leading player in RISC-V IP market, but that remains to be seen.

Sources: Tom’s Hardware (1, 2)

English translation and edit by Jozef Dudáš


  •  
  •  
  •  
Flattr this!

Zen 6 finally brings more advanced 2.5D chiplets, has RDNA 5 GPU

So far, little information has emerged about what to expect from AMD Ryzen processors with the Zen 6 architecture. The core itself should be an evolution of Zen 5, given AMD’s model resembling “tick-tock”, where even-numbered cores are less aggressive enhancements of tech introduced in odd-numbered ground-up redesigns. But it looks like everything else in the SoC could be a radical shift from the aging concept of Ryzen processor uncores. Read more “Zen 6 finally brings more advanced 2.5D chiplets, has RDNA 5 GPU” »

  •  
  •  
  •  

AMD confirms Zen 5 details: 6 ALUs, full-performance AVX-512

Zen 5-based AMD processors will launch this year (perhaps in Q3). Unlike Zen 4 which was just Zen 3 refined, Zen 5 will be another big upgrade, and according to various hints including statements by the architect Mike Clark, it could be AMD’s most interesting core since the first Zen. Interestingly, up until now, the only information about it has come from a single YouTuber source. But it has just been officially confirmed directly by AMD. Read more “AMD confirms Zen 5 details: 6 ALUs, full-performance AVX-512” »

  •  
  •  
  •  

Arrow Lake CPU spotted. Missing HT and AVX-512 confirmed

Couple days ago, documents leaked on Arrow Lake-S processors and Generation 800 chipsets – Intel’s next-gen desktop platform with the new LGA 1851 socket, due out in the second half of the year. Besides the things we already analysed, those documents also indicate that Arrow Lake P-cores have only one thread. This has now been confirmed by a log from testing a sample of this CPU. After 22 years, Intel processors are dropping HT. Read more “Arrow Lake CPU spotted. Missing HT and AVX-512 confirmed” »

  •  
  •  
  •  

Leave a Reply

Your email address will not be published. Required fields are marked *