| |
| ▲ | newpavlov 5 hours ago | parent | next [-] | | In some cases RISC-V ISA spec is definitely the one to blame: 1) https://github.com/llvm/llvm-project/issues/150263 2) https://github.com/llvm/llvm-project/issues/141488 Another example is hard-coded 4 KiB page size which effectively kneecaps ISA when compared against ARM. | | |
| ▲ | weebull 4 hours ago | parent | next [-] | | All of those things are solved with modern extensions. It's like comparing pre-MMX x86 code with modern x86. Misaligned loads and stores are Zicclsm, bit manipulation is Zb[abcs], atomic memory operations are made mandatory in Ziccamoa. All of these extensions are mandatory in the RVA22 and RVA23 profiles and so will be implemented on any up to date RISC-V core. It's definitely worth setting your compiler target appropriately before making comparisons. | | |
| ▲ | cmovq 4 minutes ago | parent | next [-] | | But RISC-V is a _new_ ISA. Why did we start out with the wrong design that now needs a bunch of extensions? | | | |
| ▲ | sidewndr46 25 minutes ago | parent | prev | next [-] | | You're correct but I guess my thoughts are if we're going to wind up with a mess of extensions, why not just use x86-64? | | |
| ▲ | whaleofatw2022 11 minutes ago | parent [-] | | Because the ISA is not encumbered the way other ISAs are legally, and there are use cases where the minimal profile is fine for the sake of embedded whatever vs the cost to implement the extensions |
| |
| ▲ | LeFantome 3 hours ago | parent | prev | next [-] | | Ubuntu being RVA23 is looking smarter and smarter. The RISC-V ecosystem being handicapped by backwards compatibility does not make sense at this point. Every new RISC-V board is going to be RVA23 capable. Now is the time to draw a line in the sand. | |
| ▲ | edflsafoiewq 3 hours ago | parent | prev | next [-] | | What about page size? | | |
| ▲ | ori_b an hour ago | parent [-] | | It's 4k on x86 as well. Doesn't seem to hurt so bad -- at least, not enough to explain the risc-v performance gap. | | |
| |
| ▲ | newpavlov 3 hours ago | parent | prev [-] | | >Misaligned loads and stores are Zicclsm Nope. See https://github.com/llvm/llvm-project/issues/110454 which was linked in the first issue. The spec authors have managed to made a mess even here. Now they want to introduce yet another (sic!) extension Oilsm... It maaaaaay become part of RVA30, so in the best case scenario it will be decades before we will be able to rely on it widely (especially considering that RVA23 is likely to become heavily entrenched as "the default"). IMO the spec authors should've mandated that the base load/store instructions work only with aligned pointers and introduced misaligned instructions in a separate early extension. (After all, passing a misaligned pointer where your code does not expect it is a correctness issue.) But I would've been fine as well if they mandated that misaligned pointers should be always accepted. Instead we have to deal the terrible middle ground. >atomic memory operations are made mandatory in Ziccamoa In other words, forget about potential performance advantages of load-link/store-conditional instructions. `compare_exchange` and `compare_exchange_weak` will always compile into the same instructions. And I guess you are fine with the page size part. I know there are huge-page-like proposals, but they do not resolve the fundamental issue. I have other minor performance-related nits such `seed` CSR being allowed to produce poor quality entropy which means that we have bring a whole CSPRNG if we want to generate a cryptographic key or nonce on a low-powered micro-controller. By no means I consider myself a RISC-V expert, if anything my familiarity with the ISA as a systems language programmer is quite shallow, but the number of accumulated disappointments even from such shallow familiarity has cooled my enthusiasm for RISC-V quite significantly. |
| |
| ▲ | adastra22 5 hours ago | parent | prev [-] | | Also the bit manipulation extension wasn't part of the core. So things like bit rotation is slow for no good reason, if you want portable code. Why? Who knows. | | |
| ▲ | adgjlsfhk1 4 hours ago | parent | next [-] | | > Also the bit manipulation extension wasn't part of the core. This is primarily because core is primarily a teaching ISA. One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Anything powerful to run (a non built from source manually) linux will support a profile that bundles all the commonly needed instructions to be fast. | | |
| ▲ | jacquesm 4 hours ago | parent | next [-] | | Bit manipulation instructions are part and parcel of any curriculum that teaches CPU architecture. They are the basic building blocks for many more complex instructions. https://five-embeddev.com/riscv-bitmanip/1.0.0/bitmanip.html I can see quite a few items on that list that imnsho should have been included in the core and for the life of me I can't see the rationale behind leaving them out. Even the most basic 8 bit CPU had various shifts and rolls baked in. | | |
| ▲ | rwmj 4 hours ago | parent | next [-] | | This is the reason behind the profiles like RVA23 which include bitmanip, vector and a large number of other extensions. Real chips coming very soon will all be RVA23. | | | |
| ▲ | kevin_thibedeau 4 hours ago | parent | prev [-] | | 32-bit barrel shifters consume significant area and RISC-V was developed to support resource constrained low cost embedded hardware in a minimal ISA implementation. | | |
| ▲ | adgjlsfhk1 3 hours ago | parent | next [-] | | IIUC this is a lot less true in the modern era. Even with 24nm transistors (the cheapest transistor last time I checked), modern microcontrollers have a fairly big transistor budget for the core (since 80+% of the transistors are going to sram anyway). | |
| ▲ | pezezin an hour ago | parent | prev | next [-] | | The 32-bit ARM architecture included a barrel shifter as part of its basic design, as in every instruction had a shift field. If a CPU built in 1985 with a grand total of 26 000 transistors could afford it, I am pretty sure that anything built in this century could afford it too. | | |
| ▲ | snvzz an hour ago | parent [-] | | 26k is a lot of transistors for an embedded MCU. You'd be excluding many small CPUs which exist within other chips running very specialized code. As profiles mandate these instructions anyway, there's no good reason to complicate the most basic RISC-V possible. RISC-V is the ISA for everything, from the smallest such CPUs to supercomputers. | | |
| ▲ | wk_end 8 minutes ago | parent [-] | | What MCUs are you thinking of? To the best of my knowledge (and Google-fu), 26K really isn't a lot of transistors for an embedded MCU - at least not a fully-featured 32-bit one comparable to a minimal RISC-V core. An ARM Cortex M0, which is pretty much the smallest thing out there, is around 10K gates => around 40K transistors. This is also around the same size as a minimal RISC-V core AFAICT. The ARM core has a shifter, though. | | |
| ▲ | snvzz 3 minutes ago | parent [-] | | There's reason RV32E and RV64E, with half the registers, are a thing. RV32I/RV64I isn't small enough. There are many chips in the market that do embed 8051s for janitorial tasks, because it is small. Some chips have several non-exposed tiny embedded CPUs within. RISC-V is replacing many of these. There's even open source designs like SERV that fit in a corner of an already small FPGA, leaving room in the FPGA for other purposes. |
|
|
| |
| ▲ | jacquesm 3 hours ago | parent | prev [-] | | You can save a lot of silicon by doing 8 or 16 bit shifters and then doing the rest at the code generation level. Not having any seems really anemic to me. |
|
| |
| ▲ | hackyhacky 4 hours ago | parent | prev [-] | | > One of the best parts about RiscV is that you can teach a freshman level architecture class or a senior level chip building project with an ISA that is actually used. Same could be said of MIPS. My understanding is the RISC-V raison d'etre is rather avoidance of patented/copywritten designs. | | |
| ▲ | adgjlsfhk1 4 hours ago | parent [-] | | the avoidance of patent/copyright is critical for (legally) having students design their own chips. MIPS was pretty good (and widely used) for teaching assembly, but pretty bad for teaching a class where students design chips |
|
| |
| ▲ | fidotron 5 hours ago | parent | prev [-] | | The fact the Hazard3 designer ended up creating an extension to resolve related oddities was kind of astonishing. Why did it fall to them to do it? Impressive that he did, but it shouldn't have been necessary. | | |
|
| |
| ▲ | fidotron 5 hours ago | parent | prev | next [-] | | > RISC-V will get there, eventually. Not trolling: I legitimately don't see why this is assumed to be true. It is one of those things that is true only once it has been achieved. Otherwise we would be able to create super high performance Sparc or SuperH processors, and we don't. As you note, Arm once was fast, then slow, then fast. RISC-V has never actually been fast. It has enabled surprisingly good implementations by small numbers of people, but competing at the high end (mobile, desktop or server) it is not. | | |
| ▲ | lizknope 4 hours ago | parent | next [-] | | I think the bigger question is does RISC-V need to be fast? Who wants to make it fast? I'm a chip designer and I see people using RISC-V as small processor cores for things like PCIE link training or various bookkeeping tasks. These don't need to be fast, they need to be small and low power which means they will be relatively slow. Most people on tech review sites only care about desktop / laptop / server performance. They may know about some of the ARM Cortex A series CPUs that have MMUs and can run desktop or smartphone Linux versions. They generally don't care about the ARM Cortex M or R versions for embedded and real time use. Those are the areas where you don't need high performance and where RISC-V is already replacing ARM. EDIT: I'll add that there are companies that COULD make a fast RISC-V implementation. Intel, AMD, Apple, Qualcomm, or Nvidia could redirect their existing teams to design a high performance RISC-V CPU. But why should they? They are heavily invested in their existing x86 and ARM CPU lines. Amazon and Google are using licensed ARM cores in their server CPUs. What is the incentive for any of them to make a high performance RISC-V CPU? The only reason I can think of is that Softbank keeps raising ARM licensing costs and it gets high enough that it is more profitable to hire a team and design your own RISC-V CPU. | | |
| ▲ | adgjlsfhk1 3 hours ago | parent [-] | | Of your list, Qualcomm and Nvidia are fairly likely to make high perf Riscv cpus. Qualcomm because Arm sued them to try and stop them from designing their own arm chips without paying a lot more money, and Nvidia because they already have a lot of teams making riscv chips, so it seems likely that they will try to unify on the one that doesn't require licensing. | | |
| ▲ | lizknope an hour ago | parent [-] | | Yeah, they could but then what is the market? Qualcomm wants to sell smartphone chips and Android can run on RISC-V and most Android Java apps could in theory run. But if you look at the Intel x86 smartphone chips from about 10 years ago they had to make an ARM to x86 emulator because even the Java apps contained native ARM instructions for performance reasons. Qualcomm is trying to push their ARM Snapdragon chips in Windows laptops but I don't think they are selling well. Nvidia could also make RISC-V based chips but where would they go? Nvidia is moving further away from the consumer space to the data center space. So even if Nvidia made a really fast RISC-V CPU it would probably be for the server / data center market and they may not even sell it to ordinary consumers. Or if they did it could be like the Ampere ARM chips for servers. Yeah you can buy one as an ordinary consumer but they were in the $4,000 range last time I looked. How many people are going to buy that? |
|
| |
| ▲ | rwmj 5 hours ago | parent | prev | next [-] | | RISC-V doesn't have the pitfalls of Sparc (register windows, branch delay slots), largely because we learned from that. It's in fact a very "boring" architecture. There's no one that expects it'll be hard to optimize for. There are at least 2 designs that have taped out in small runs and have high end performance. | | |
| ▲ | adrian_b 4 hours ago | parent | next [-] | | RISC-V does not have the pitfalls of experimental ISAs from 45 years ago, but it has other pitfalls that have not existed in almost any ISA since the first vacuum-tube computers, like the lack of means for integer overflow detection and the lack of indexed addressing. Especially the lack of integer overflow detection is a choice of great stupidity, for which there exists no excuse. Detecting integer overflow in hardware is extremely cheap, its cost is absolutely negligible. On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably, because each arithmetic operation must be replaced by multiple operations. Because of the unacceptable cost, normal RISC-V programs choose to ignore the risk of overflows, which makes them unreliable. The highest performance implementations of RISC-V from previous years were forced to introduce custom extensions for indexed addressing, but those used inefficient encodings, because something like indexed addressing must be in the base ISA, not in an extension. | | |
| ▲ | adgjlsfhk1 4 hours ago | parent | next [-] | | > On the other hand, detecting integer overflow in software is extremely expensive this just isn't true. both addition and multiplication can check for overflow in <2 instructions. | | |
| ▲ | nine_k 3 hours ago | parent | next [-] | | Fewer than two is exactly one instruction. Which? | | | |
| ▲ | adrian_b 3 hours ago | parent | prev [-] | | You are delusional. Read the RISC-V documentation to see that you need at least 3 instructions. For addition, overflow happens when you add 2 positive numbers and the result is negative, or when you add 2 negative numbers and the result is positive. After using an instruction to do the addition, how can you detect this complex condition with a single instruction in the impoverished RISC-V ISA? EDIT:
Someone has downvoted this, presumably because I have not been polite enough. That may be true, but I consider much more impolite the kind of misleading false information that has been written by the poster to whom I have replied. It is difficult to read that kind of b*s*t and be cool about it. | | |
| ▲ | burntoutgray 2 hours ago | parent [-] | | +1 -- misinformation is best corrected quickly. If not, AI will propagate it and many will believe the erroneous information. I guess that would be viral hallucinations. |
|
| |
| ▲ | hackyhacky 4 hours ago | parent | prev [-] | | > On the other hand, detecting integer overflow in software is extremely expensive, increasing both the program size and the execution time considerably, Most languages don't care about integer overflow. Your typical C program will happily wrap around. If I really want to detect overflow, I can do this: add t0, a0, a1
blt t0, a0, overflow
Which is one more instruction, which is not great, not terrible. | | |
| ▲ | sitharus 3 hours ago | parent | next [-] | | Because the other commenter wasn’t posting the actual answer, I went to find the documentation about checking for integer overflow and it’s right here https://docs.riscv.org/reference/isa/unpriv/rv32.html#2-1-4-... And what did I find? Yep that code is right from the manual for unsigned integer overflow. For signed addition if you know one of the signs (eg it’s a compile time constant) the manual says addi t0, t1, +imm
blt t0, t1, overflow
But the general case for signed addition if you need to check for overflow and don’t have knowledge of the signs add t0, t1, t2
slti t3, t2, 0
slt t4, t0, t1
bne t3, t4, overflow
From what I’ve read most native compiled code doesn’t really check for overflows in optimised builds, but this is more of an issue for JavaScript et al where they may detect the overflow and switch the underlying type? I’m definitely no expert on this. | | |
| ▲ | sitharus 39 minutes ago | parent [-] | | A bit more reading shows there's a three instruction general case version for 32-bit additions on the 64-bit RISC-V ISA. I'm not familiar with RISC-V assembly and they didn't provide an example, but I _think_ it's as easy as this since 64-bit add wouldn't match the 32-bit overflowed add. add t0, t1, t2
addw t3, t1, t2
bne t0, t3, overflow
|
| |
| ▲ | adrian_b 4 hours ago | parent | prev | next [-] | | That is not the correct way to test for integer overflow. The correct sequence of instructions is given in the RISC-V documentation and it needs more instructions. "Integer overflow" means "overflow in operations with signed integers". It does not mean "overflow in operations with non-negative integers". The latter is normally referred as "carry". The 2 instructions given above detect carry, not overflow. Carry is needed for multi-word operations, and these are also painful on RISC-V, but overflow detection is required much more frequently, i.e. it is needed at any arithmetic operation, unless it can be proven by static program analysis that overflow is impossible at that operation. | |
| ▲ | refulgentis 4 hours ago | parent | prev [-] | | I have no idea or practical experience with anything this low-level, so idk how much following matters, it's just someone from the crowd offering unvarnished impressions: It's easy to believe you're replying to something that has an element of hyperbole. It's hard to believe "just do 2x as many instructions" and "ehhh who cares [i.e. your typical C program doesn't check for overflow]", coupled to a seemingly self-conscious repetition of a quip from the television series Chernobyl that is meant to reference sticking your head in the sand, retire the issue from discussion. | | |
| ▲ | adrian_b 4 hours ago | parent [-] | | There was no hyperbole in what I have said. The sequence of instructions given above is incorrect, it does not detect integer overflow (i.e. signed integer overflow). It detects carry, which is something else. The correct sequence, which can be found in the official RISC-V documentation, requires more instructions. Not checking for overflow in C programs is a serious mistake. All decent C compilers have compilation options for enabling checking for overflow. Such options should always be used, with the exception of the functions that have been analyzed carefully by the programmer and the conclusion has been that integer overflow cannot happen. For example with operations involving counters or indices, overflow cannot normally happen, so in such places overflow checking may be disabled. |
|
|
| |
| ▲ | classichasclass 4 hours ago | parent | prev | next [-] | | As a counterexample, I point to another relatively boring RISC, PA-RISC. It took off not (just) because the architecture was straightforward, but because HP poured cash into making it quick, and PA-RISC continued to be a very competitive architecture until the mass insanity of Itanic arrived. I don't see RISC-V vendors making that level of investment, either because they won't (selling to cheap markets) or can't (no capacity or funding), and a cynical take would say they hide them behind NDAs so no one can look behind the curtain. I know this is a very negative take. I don't try to hide my pro-Power ISA bias, but that doesn't mean I wouldn't like another choice. So far, however, I've been repeatedly disappointed by RISC-V. It's always "five or six years" from getting there. | | |
| ▲ | adrian_b 3 hours ago | parent [-] | | I would not call PA-RISC boring. Already at launch there was no doubt that it is a better ISA than SPARC or MIPS, and later it was improved. At the time when PA-RISC 2.0 was replaced by Itanium it was not at all clear which of the 2 ISAs is better. The later failures to design high-performance Itanium CPUs make plausible that if HP would have kept PA-RISC 2.0 they might have had more competitive CPUs than with Itanium. SPARC (formerly called Berkeley RISC) and MIPS were pioneers that experimented with various features or lack of features, but they were inferior from many points of view to the earlier IBM 801. The RISC ISAs developed later, including ARM, HP PA-RISC and IBM POWER, have avoided some of the mistakes of SPARC and MIPS, while also taking some features from IBM 801 (e.g. its addressing modes), so they were better. | | |
| ▲ | burntoutgray an hour ago | parent | next [-] | | ISAs fail to gain traction when the sufficiently smart compilers don't eventuate. The x86-64 is a dog's breakfast of features. But due to its widespread use, compiler writers make the effort to create compilers that optimize for its quirks. Itanium hardware designers were expecting the compiler writers to cater for its unique design. Intel is a semi company. As good as some of their compilers are, internally they invested more in their biggest seller and the Itanium never got the level of support that was anticipated at the outset. | |
| ▲ | classichasclass an hour ago | parent | prev [-] | | I mean "boring" in the sense that its ISA was relatively straightforward, no performance-entangling kinks like delay slots, a good set of typical non-windowed GPRs, no wild or exotic operations. And POWER/PowerPC and PA-RISC weren't a lot later than SPARC or MIPS, either. |
|
| |
| ▲ | fidotron 5 hours ago | parent | prev [-] | | > RISC-V doesn't have the pitfalls of Sparc (register windows, branch delay slots), You're saying ISA design does have implementation performance implications then? ;) > There's no one that expects it'll be hard to optimize for [Raises hand] > There are at least 2 designs that have taped out in small runs and have high end performance. Are these public? Edit: I should add, I'm well aware of the cultural mismatch between HN and the semi industry, and have been caught in it more than a few times, but I also know the semi industry well enough to not trust anything they say. (Everything from well meaning but optimistic through to outright malicious depending on the company). | | |
| ▲ | rwmj 5 hours ago | parent [-] | | The 2 designs I'm thinking of are (tiresomely) under NDA, although I'm sure others will be able to say what they are. Last November I had a sample of one of them in my hand and played with the silicon at their labs, running a bunch of AI workloads. They didn't let me take notes or photographs. > There's no one that expects it'll be hard to optimize for No one who is an expert in the field, and we (at Red Hat) talk to them routinely. |
|
| |
| ▲ | Findecanor 3 hours ago | parent | prev | next [-] | | Because today, getting a fast CPU out it isn't as much an engineering issue as it is about getting the investment for hiring a world-class fab. The most promising RISC-V companies today have not set out to compete directly with Intel, AMD, Apple or Samsung, but are targeting a niche such as AI, HPC and/or high-end embedded such as automotive. And you can bet that Qualcomm has RISC-V designs in-house, but only making ARM chips right now because ARM is where the market for smartphone and desktop SoCs is.
Once Google starts allowing RVA23 on Android / ChromeOS, the flood gates will open. | | |
| ▲ | adgjlsfhk1 3 hours ago | parent [-] | | It's very much both. You need millions of dollars for the fab, but you also need ~5 years to get 3 generations of cpus out (to fix all the performance bugs you find in the first two) |
| |
| ▲ | gt0 5 hours ago | parent | prev | next [-] | | I don't think anybody suggests Oracle couldn't make faster SPARC processors, it's just that development of SPARC ended almost 10 years ago. At the time SPARC was abandoned, it was very competitive. | | |
| ▲ | twoodfin 3 hours ago | parent | next [-] | | In single-threaded performance? That’s not how I remember it: Sun was pushing parallel throughput over everything else, with designs like the T-Series & Rock. | | |
| ▲ | gt0 2 hours ago | parent [-] | | Perhaps not single thread, but Rock was a dead end a while before Oracle pulled the plug, and Sun/Oracle's core market of course was always servers not workstations. We used Niagara machines at my work around the T2 era, a long time ago, but they were very competitive if you could saturate the cores and had the RAM to back it up. | | |
| ▲ | twoodfin an hour ago | parent [-] | | Sure, my work got a few of the Niagaras too and they were tremendous build machines for Solaris software. But if you’re judging an ISA by performance scalability, you generally want to look at single-threaded performance. |
|
| |
| ▲ | icedchai an hour ago | parent | prev [-] | | Sparc stopped being competitive in the early 2000’s. |
| |
| ▲ | snvzz an hour ago | parent | prev [-] | | Fast, RVA23-compatible microarchitectures already exist. Everything high performance seems to be based on RVA23, which is the current application profile and comparable to ARMv9 and x86-64v4. However, it takes time from microarchitecture to chips, and from chips to products on shelves. The very first RVA23-compatible chips to show up will likely be the spacemiT K3 SoC, due in development boards April (i.e. next month). More of them, more performant, such as a development board with the Tenstorrent Ascalon CPU in the form of the Tenstorrent Atlantis SoC, are coming this summer. It is even possible such designs will show up in products aimed at the general public within the present year. |
| |
| ▲ | rwmj 5 hours ago | parent | prev | next [-] | | Marcin is working with us on RISC-V enablement for Fedora and RHEL, he's well aware of the problem with current implementations. We're hopeful that this'll be pretty much resolved by the end of the year. | | |
| ▲ | LeFantome 3 hours ago | parent [-] | | If he expects it to be resolved by the end of the year (and I agree it likely will be), why is he writing a post like this? Is this because Fedora 44 is going to beta? |
| |
| ▲ | Dwedit 5 hours ago | parent | prev | next [-] | | There's the ARM video from LowSpecGamer, where they talk about how they forgot to connect power to the chip, and it was still executing code anyway. According to Steve Furber, the chip was accidentally being powered from the protection diodes alone. So ARM was incredibly power efficient from the very beginning. | |
| ▲ | cogman10 5 hours ago | parent | prev | next [-] | | > AND the software with no architecture-specific optimisations The optimizations that'd be applied to ARM and MIPS would be equally applicable to RISC-V. I do not believe this is a lack of software optimization issue. We are well past the days where hand written assembly gives much benefit, and modern compilers like gcc and llvm do nearly identical work right up until it comes to instruction emissions (including determining where SIMD instructions could be placed). Unless these chips have very very weird performance characteristics (like the weirdness around x86's lea instruction being used for arithmetic) there's just not going to be a lot of missed heuristics. | | |
| ▲ | hrmtst93837 5 hours ago | parent | next [-] | | One thing compilers still struggle with is exploiting weird microarchitectural quirks or timing behaviors that aren't obvious from the ISA spec, especially with memory, cache and pipeline tuning. If a new RISC-V core doesn't expose the same prefetching tricks or has odd branch prediction you won't get parity just by porting the same backend. If you want peak numbers sometimes you do still need to tune libraries or even sprinkle in a bit of inline asm despite all the "let the compiler handle it" dogma. | | |
| ▲ | cogman10 5 hours ago | parent | next [-] | | While true, it's typically not going to be impactful on system performance. There's a reason, for example, why the linux distros all target a generic x86 architecture rather than a specific architecture. | | |
| ▲ | spockz 4 hours ago | parent | next [-] | | Not all. CachyOS has specific builds for v3, v4, and AMD Zen4/5: https://wiki.cachyos.org/features/optimized_repos/ | |
| ▲ | thesuperbigfrog 2 hours ago | parent | prev | next [-] | | Ubuntu recently added a more specific target for AMD64v3: https://discourse.ubuntu.com/t/introducing-architecture-vari... | |
| ▲ | adrian_b 4 hours ago | parent | prev [-] | | Some applications may target a generic x86 architecture without any impact on performance. However, other applications which must do cryptographic operations, audio/video processing, scientific/technical/engineering computing, etc. may have wildly different performances when compiled for different x86-64 ISA versions, for which dedicated assembly-language functions exist. | | |
| ▲ | cogman10 3 hours ago | parent | next [-] | | Granted, these applications do exist. They are simply becoming more and more rare. I'd also say that there's been a pretty steady dedicated effort to abstracting the assembly. It's still pretty low level, as in you are caring about the specific instructions being used, but it's also not quite assembly in both C++/rust. Java, interestingly enough, is somewhat leading the way here with their Vector API. I think they actually have one of the better setups for allowing someone to write fast code that is platform independent. C++ is also diving into this realm. 26 just merged in now SIMD instructions. That is the bulk of the benefit of diving down into assembly. https://en.cppreference.com/w/cpp/numeric/simd.html | | |
| ▲ | adrian_b 3 hours ago | parent [-] | | I would not say that such applications are becoming more and more rare. Most of the applications whose performance matters for me, because I must wait a non-negligible time for them to do their job, are dependent on assembly implementation for certain functions invoked inside critical loops. I do not see any sign of replacements for them. On the contrary, Intel, AMD and Arm continue to introduce special instructions that are useful in certain niche applications and taking advantage of them will require additional assembly language functions, not less. For me, there is only one application that I use and which consumes non-negligible computer time and which does not depend on SIMD optimizations, which is the compilation of software projects. |
| |
| ▲ | CyberDildonics 43 minutes ago | parent | prev [-] | | audio/video processing, scientific/technical/engineering computing, etc. may have wildly different performances when compiled for different x86-64 ISA versions This is pretty vague and makes it sounds like there are big differences in instruction sets. In actuality it comes down to memory access first which has nothing to with instructions. After that it comes down to simple SIMD/AVX instructions and not some exotic entirely different instruction set. |
|
| |
| ▲ | CyberDildonics an hour ago | parent | prev [-] | | The things you are talking about are taken care of by out of order execution and the CPU itself being smart about how it executes. Putting in prefetch instructions rarely beats the actual prefetcher itself. Compilers didn't end up generating perfect pentium asm either. OOO execution is what changed the game in not needing perfect compiler output any more. |
| |
| ▲ | bobmcnamara 5 hours ago | parent | prev [-] | | > The optimizations that'd be applied to ARM and MIPS would be equally applicable to RISC-V. There's no carry bit, and no widening multiply(or MAC) | | |
| ▲ | Findecanor 2 hours ago | parent [-] | | RISC-V splits widening multiply out into two instructions: one for the high bits and one for the low. Just like 64-bit ARM does. Integer MAC doesn't exist, and is also hindered by a design decision not to require more than two source operands, so as to allow simple implementations to stay simple.
The same reason also prevents RISC-V from having a true conditional move instruction: there is one but the second operand is hard-coded zero. FMAC exists, but only because it is in the IEEE 754 spec ... and it requires significant op-code space. |
|
| |
| ▲ | bsder 4 hours ago | parent | prev | next [-] | | > Don't blame the ISA - blame the silicon implementations That's true, but tautological. The issue is that the RISC-V core is the easy part of the problem, and nobody seems to even be able to generate a chip that gets that right without weirdness and quirks. The more fundamental technical problem is that things like the cache organization and DDR interface and PCI interface and ... cannot just be synthesized. They require analog/RF VLSI designers doing things like clock forwarding and signal integrity analysis. If you get them wrong, your performance tanks, and, so far, everybody has gotten them wrong in various ways. The business problem is the fact that everybody wants to be the "performance" RISC-V vendor, but nobody wants to be the "embedded" RISC-V vendor. This is a problem because practically anybody who is willing to cough up for a "performance" processor is almost completely insensitive to any cost premium that ARM demands. The embedded space is hugely sensitive to cost, but nobody is willing to step into it because that requires that you do icky ecosystem things like marketing, software, debugging tools, inventory distribution, etc. This leads to the US business problem which is the fact that everybody wants to be an IP vendor and nobody wants to ship a damn chip. Consequently, if I want actual RISC-V hardware, I'm stuck dealing with Chinese vendors of various levels of dodginess. | |
| ▲ | api 6 hours ago | parent | prev | next [-] | | A pattern I've noticed for a very long time: A lot of times the path to the highest performing CPU seems to be to optimize for power first, then speed, then repeat. That's because power and heat are a major design constraint that limits speed. I first noticed this way back with the Pentium 4 "Netburst" architecture vs. the smaller x86 cores that became the ancestor of the Core architecture. Intel eventually ran into a wall with P4 and then branched high performance cores off those lower-power ones and that's what gave us the venerable Core architecture that made Intel the dominant CPU maker for over a decade. ARM's history is another example. | | |
| ▲ | cpgxiii 5 hours ago | parent | next [-] | | I think the story is a bit more complicated. Core succeeded precisely because Intel had both the low-power experience with Pentium-M and the high-power experience with Netburst. The P4 architecture told them a lot about what was and wasn't viable and at what complexity. When you look at the successor generations from Core, what you see are a lot of more complex P4-like features being re-added, but with the benefits of improved microarch and fab processes. Obviously we will never know, but I don't think you would get to Haswell or Skylake in the form they were without the learning experience of the P4. In comparison, I think Arm is actually a very strong cautionary tale that focusing on power will not get you to performance. Arm processors remained pretty poor performance until designers from other CPU families entirely (PowerPC and Intel) took it on at Apple and basically dragged Arm to the performance level they are today. | | |
| ▲ | maximilianburke 2 hours ago | parent [-] | | And not just any PowerPC architects either, but the people from PA Semi. Motorola couldn't get the speed up and IBM couldn't get the power down. |
| |
| ▲ | jnovek 5 hours ago | parent | prev | next [-] | | I don’t have a micro architecture background so I apologize if this is obvious — What do power and speed mean in this context? | | |
| ▲ | McP 5 hours ago | parent | next [-] | | Power - how many Watts does it need?
Speed - how quickly can it perform operations? | | |
| ▲ | wmf 4 hours ago | parent [-] | | You can get low power with a simple design at a low clock. This definitely will not help achieve high performance later. | | |
| ▲ | weebull 3 hours ago | parent [-] | | Clock rate isn't the only factor. A design can be power hungry at a low clock rate if designed badly, and if it it is... you're never getting that think running fast. |
|
| |
| ▲ | unethical_ban 5 hours ago | parent | prev [-] | | One could say "Optimize for efficiency first, then performance". |
| |
| ▲ | cptskippy 5 hours ago | parent | prev | next [-] | | Core evolved from the Banis (Centrino) CPU core which was based on P3, not P4. Banias used the front-side bus from P4 but not the cores. Banias was hyper optimized for power, the mantra was to get done quickly and go to sleep to save power. Somewhere along the line someone said "hey what happens if we don't go to sleep?" and Core was born. | |
| ▲ | jauntywundrkind 5 hours ago | parent | prev [-] | | Parallels to code design, where optimizing data or code size can end up having fantastic performance benefits (sometimes). |
| |
| ▲ | dmitrygr 6 hours ago | parent | prev | next [-] | | IF you care to read the article, they indeed do not blame the architecture but the available silicon implementations. | | |
| ▲ | topspin 6 hours ago | parent | next [-] | | I keep checking in on Tenstorrent every few months thinking Keller is going to rock our world... losing hope. At this point the most likely place for truly competitive RISC-V to appear is China. | | |
| ▲ | Findecanor 2 hours ago | parent | next [-] | | Tenstorrent is supposedly taping out 8-wide Ascalon processors as we speak, with devboards projected to be available in Q2/Q3 this year. BTW. Keller is also on the board of AheadComputing — founded by former Intel engineers behind the fabled "Royal Core". | | |
| ▲ | snvzz 41 minutes ago | parent [-] | | >Ascalon tape out Supposedly happened earlier this year. Now we just wait. |
| |
| ▲ | rbanffy 6 hours ago | parent | prev [-] | | > At this point the most likely place for fast RISC-V to appear is China. Or we just adopt Loongson. | | |
| ▲ | balou23 6 hours ago | parent | next [-] | | TBH I still don't really get how it's different from MIPS. As far as I can tell... Loongson seems to be really just MIPS, while LoongArch is MIPS with some extra instructions. | | |
| ▲ | bonzini 5 hours ago | parent | next [-] | | LoongArch is, on a first approximation, an almost RISC-V user space instruction set together with MIPS-like privileged instructions and registers. | | | |
| ▲ | pantalaimon 5 hours ago | parent | prev | next [-] | | They did get rid of the delay slots and some other MIPS oddities | |
| ▲ | mananaysiempre 5 hours ago | parent | prev [-] | | But legally distinct! I guess calling it M○PS was not enough for plausible deniability. | | |
| |
| ▲ | throawayonthe 5 hours ago | parent | prev [-] | | (purely on vibes) loongson feels to me like an intermediate step/backup strategy rather than a longterm target (though they'll probably power govt equipment for decades of legacy either way :p) |
|
| |
| ▲ | rbanffy 6 hours ago | parent | prev | next [-] | | I did read it. A Banana Pi is not the fastest developer platform. The title is misleading. BTW, it's quite impressive how the s390x is so fast per core compared to the others. I mean, of course it's fast - we all knew that. And don't let IBM legal see this can be considered a published benchmark, because they are very shy about s390x performance numbers. | | |
| ▲ | Aurornis 5 hours ago | parent | next [-] | | > A Banana Pi is not the fastest developer platform. What is the current fastest platform that isn’t exorbitantly expensive? Not upcoming releases, but something I can actually buy. I check in every 3-6 months but the situation hasn’t changed significantly yet. | | |
| ▲ | adgjlsfhk1 4 hours ago | parent | next [-] | | A P550 based board is the best you can get for now (~2-3x faster than the Banana Pi). In 2-3 months there should be a number of SpaceMIT k3 chips that are ~4-6x faster than the banana pi and somewhat reasonably priced (~200-300). By the end of the year, however, you should be able to get an ascalon chip which should be way way faster than that (roughly apple m1/zen3 speed) | |
| ▲ | cestith 5 hours ago | parent | prev [-] | | What is the current fastest ppc64le implementation that isn’t exorbitantly expensive? How about the s390x? |
| |
| ▲ | gt0 6 hours ago | parent | prev | next [-] | | I was really surprised by the s390x performance, but I also don't really understand why there are build time listed by architecture, not the actual processors. | | |
| ▲ | kpil 4 hours ago | parent | next [-] | | What's fast on Z platforms is typically IO rather than raw CPU - the platform can push a lot of parallell data. This is typically the bottleneck when compiling. The cores are in my experience moderately fast at most. Note that there are a lot of licencing options and I think some are speed-capped - but I don't think that applies to IFL - a standard CPU licence-restricted to only run linux. | | | |
| ▲ | rbanffy 5 hours ago | parent | prev | next [-] | | Probably because that's just the infrastructure they have. | |
| ▲ | pantalaimon 5 hours ago | parent | prev [-] | | i686 builds even faster |
| |
| ▲ | snvzz 38 minutes ago | parent | prev | next [-] | | >I did read it. A Banana Pi is not the fastest developer platform. The title is misleading. Ironically, its SoC (spacemiT K1) is slower than the JH7110 used in the first mass-produced RISC-V SBC, VisionFive 2. But unlike JH7110, it has vector 1.0, making it a very popular target. Of course, none of these pre-RVA23 boards will be relevant anymore, once the first development boards with RVA23-compatible K3 ship next month. | |
| ▲ | menaerus 6 hours ago | parent | prev [-] | | Which risc-v implementation is considered fast? | | |
| ▲ | LeFantome 3 hours ago | parent | next [-] | | > Which risc-v implementation is considered fast? SpacemiT K3 is 2010 Macbook performance single-core, 2019 Macbook Air multi-core, and better than M4 Apple Silicon for AI. So I guess it depends on what you are going to do with it. | |
| ▲ | patchnull 6 hours ago | parent | prev | next [-] | | Nothing shipping today is really competitive with modern ARM or x86. The SiFive P870 and Tenstorrent Ascalon (Jim Keller's team) are the most anticipated high-performance designs, but neither is widely available. What you can actually buy today tops out around Cortex-A76 class single-thread performance at best, which is roughly where ARM was five or six years ago. | | |
| ▲ | menaerus 5 hours ago | parent | next [-] | | I remember taking down some notes wrt SiFive P870 specs, comparing them to x86_64, and reaching the same conclusion. Narrower core width (4-wide vs 8-wide), lower clock frequency (peaks at 3GHz) and no turbo (?), limited support for vector execution (128-bit vs 512-bit), limited L1 bandwidth (1x 128-bit load/cycle?), limited FP compute (2x 128-bit vs 2x 512-bit), load queue is also inconveniently small with 48 entries (affecting already limited load bandwidth), unclear system memory bandwidth and how it scales wrt the number of cores (L3 contention) although for the latter they seem to use what AMD is doing (exclusive L3 cache per chiplet). | |
| ▲ | LeFantome 3 hours ago | parent | prev [-] | | SpacemiT K3 is about the same performance as a Rockchip RK3588. So, 4 years ago? Except the K3 kills it on AI (60 TOPS). |
| |
| ▲ | NooneAtAll3 5 hours ago | parent | prev [-] | | DC-ROMA 2 is on the Rasperry 4 level of performance last I heard |
|
| |
| ▲ | tromp 6 hours ago | parent | prev | next [-] | | But they didn't reflect that in a title like "current RISC-V silicon Is Sloooow" ... | |
| ▲ | spiderice 6 hours ago | parent | prev [-] | | Then how do you justify the title? |
| |
| ▲ | crest 3 hours ago | parent | prev [-] | | RISC-V lacks a bunch of really useful relatively easy to implement instructions and most extensions are truly optional so you can't rely on them. That's the problem if you let a bunch of academics turn your ISA into a paper mill. In theory you can spend a lot of effort to make a flawed ISA perform, but it will be neither easy nor pretty e.g. real world Linux distros can't distribute optimised packages for every uarch from dual-issue in-order RV64GC to 8-wide OoO RV64 with all the bells and whistles. Only in (deeply) embedded systems can you retarget the toolchain and optimise for each damn architecture subset you encounter. |
|