▲	brucehoult 2 days ago
		> That documentation also seems to be the original source of the marketing claims that the K3 provides 60 TOPS of AI compute and can run 30B models at over 10 tokens/s. Well, sort of– as another spoiler, I can share that I hit a hard cap at an effective 3B I'm running Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf on mine ... picked kind of at random from a web page as I'm a complete n00b at running local LLMs, have never used anything other than ChatGPT or (mostly) Grok. I get 6-7 tok/s, which is slow for throwing around program code, but fine for general knowledge queries. It's a reasonable speed to read along as it outputs. Simple queries start to give output in about 2 seconds. e.g. I dunno .. > What does auipc do? `auipc` stands for "Add Upper Immediate to PC" and is a RISC-V instruction that: ## What it does: - Takes a 20-bit immediate value and adds it to the program counter (PC) - The immediate is sign-extended and shifted left by 12 bits - The result is stored in a destination register ## Syntax: ``` auipc rd, imm ``` ## Key characteristics: - PC-relative addressing: Uses the current program counter as a base - 12-bit offset: The immediate is shifted left by 12 bits, so it's effectively a 32-bit value - No memory access: Only performs arithmetic, no memory operations - Used for: Creating 32-bit immediate values, setting up base addresses for other instructions ## Example: ```assembly auipc x5, 0x1000 # x5 = PC + (0x1000 << 12) ``` ## Common uses: - Setting up base addresses for load/store instructions - Creating 32-bit immediate values when the standard `li` instruction can't handle the full range - Jump table setup - Address calculation for large data structures The instruction is particularly useful in RISC-V because it allows for efficient 32-bit address generation while maintaining the 16-bit instruction size constraint. [ Prompt: 20.6 t/s \| Generation: 6.5 t/s ] Output started in about 2 seconds. > what is the airspeed velocity of an unladen swallow? The classic answer to this question comes from Monty Python's "Holy Grail" - when asked about the airspeed velocity of an unladen swallow, the character replies "What do you mean? An African or European swallow?" But in reality, the airspeed velocity depends on the species: European Swallow (Hirundo rustica): - Typical flight speed: 15-20 mph (24-32 km/h) - Maximum speed: around 25 mph (40 km/h) African Swallow (Hirundo sp.): - Similar speeds to European swallows, though some species can reach up to 30 mph (48 km/h) The "unladen" part refers to the swallow not carrying anything, which would slow it down. Swallows are quite agile birds that can fly at impressive speeds while maneuvering through the air. So while the Python reference is the more famous answer, the real-world airspeed of an unladen swallow is roughly 15-25 mph, depending on the specific species and conditions. [ Prompt: 25.5 t/s \| Generation: 6.6 t/s ] Again, output starts in about two seconds. This is offline, no internet, and uses 14W while running all 8 A100 "AI" cores at max. Is this useful? I mean, for something, right? I asked it to review https://github.com/brucehoult/trv which is a total of 320 lines of code (I used `/read` on a tar file containing the two code files). It thought for 22 minutes before output started and then spent 8 minutes outputting comments at just over 6.5 tok/s. Nothing there to scare Claude, but 30 minutes total is still faster than asking a colleague for a code review, and probably more comprehensive too. And it did it on about 0.25 cents of electricity. > Turns out getting a thread onto the A100 cores requires a two-step handshake: > > write the thread’s TID to /proc/set_ai_thread (a kernel interface that unlocks scheduling on cores 8–15 for that specific thread) > then call sched_setaffinity to pin it. If you want to just run arbitrary Linux programs on the A100 cores, I wrote a small assembly language launcher which does the above PID writing and then EXECs the thing you really want. # just run a single program on the A100 cores ai as hello.s -o hello.o # same thing but maybe 1ms faster aix /usr/bin/as hello.s -o hello.o # run a whole build. All processes started by `make` will run on the A100 cores. ai make -j8 test # start a shell on the A100 cores. All programs run from it will be run only on the A100 cores ai bash https://github.com/brucehoult/k3_ai As normal CPUs the eight 2-wide in-order A100 cores (like an A53 or A55 or Pentium or PPC603) add about 40% normal scalar processing power to the eight X100 cores. That's better than Hyperthreading and well worth using for some additional processing power. Just kick off a background build, or CI or something there while you do something else on the X100 cores. If you ignore the special "AI" matrix processing extension they are just perfectly normal RISC-V RVA23 cores as far as user code is concerned — and in fact significantly faster than the previous generation K1 chip. A Linux kernel build on just the A100 "AI" cores is faster than on any previous RISC-V SBC under $1000, including the HiFive Premier P550 or Milk-V Megrez. It's several times faster than the VisionFive 2 or Milk-V Jupiter / BPI-F3. `Build Linux kernel 7503345ac5f5 14m25.56s SpacemiT K3 8 X100 cores plus 8x A100 cores distcc 18m3.871s SpacemiT K3 8 X100 cores @2.4 GHz 19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker 39m23.187s SpacemiT K3 8 A100 cores @2.0 GHz 42m12.414s Milk-V Megrez 4 P550 cores @1.8 GHz 67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz 70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz 88m04.185s LicheePi 4A, 4 C910 cores @1.85 GHz` The K3 is also faster than using QEMU/Docker on my 24 core i9-13900 laptop, and while using 25W instead of 200W. Note the fastest time using a distccd on the X100 cores and another distccd on the A100 cores. This adds a lot of overhead in preprocessing and communication over the network (loopback, but still). But it still gives a pretty nice boost. But running independent tasks on each set of cores is more efficient. Or teaching `gmake` or `ninja` to distribute to two pools of cores using my `ai` launcher would be even better ...