▲	jauntywundrkind 6 hours ago
		Pentium 4 (2000) released at 3.2GB/s memory bandwidth, and scaled to 6.4GB/s over the years. That was not a chip to be proud of, but it provides a snapshot, a reference point in time to compare against. Having 3GB/s memory bandwidth here is... surprising. Based off the single vs multi-scores looking so lopsided, it sure seems likely. Having an "AI" inference chip with such bandwidth is wild. Comparing to the Cix P1 / Orange Pi 6, that having ~42 GB/s compares well to the P4's L2 cache speed! Wow. RK3588 real world will show ~22GB/s, RPI5 17GB/s. NVMe reads were faster! (Some interesting potential wins there, assuming you can get data from NVMe onto the core without going through main memory, a feature available since Sandy Bridge-EP (2011), in the form of Data Direct IO aka DDIO). I crack jokes about "PCIe speed ahead", but that's seemingly real here (at huge cost to latency, which CXL promises to remedy). There is a non-zero chance the main cores cannot saturate what the memory controller can do, that the AI cores have some reserved bandwidth to themselves. I doubt it's going to double the memory bna One absolute ecosystem gem from this article that I didn't know before: the fact that Orange PI 6 uses CrosEC, the embedded controller for Chromebooks (RIP i guess?). I wonder if this is the newer Zephyr Iot (awesome, also underlies Framework's new embedded controllers) or the older legacy version of CrosEC. Not spoken of flatteringly in this implementation, but it's super notable to me the borrowing of firmware from this place I didn't expect it! But there's good upstream kernel support so makes sense! https://chromium.googlesource.com/chromiumos/platform/ec/+/H... One architectural nit I need to dig into that's interesting: the shared AI processors on the AI cores appear to have shared AI units. This reminds me a lot of AMD Bulldozer (2011), which had semi-independent CPUs but shared FPU. It was an interesting chip (still haven't disposed of my old FX-8320 server), but not well loved. Really appreciate the dive into the matrix cores. That's going to take more time for me to look at, but: thanks. I notice the architecture diagram says all cores have AI instructions, not just the A100's. Presumably it's the same instruction set/features? The memory bandwidth situation here feels so off. We've lived in a world where it's a battle for cores, where how many cores one could ship made chip empires rise and fall. Today, the memory bandwidth wars are on, and supplies are scarce. This looks like a fascinating board with amazing capabilities, but wow, that lack of memory bandwidth here is most surprising.
	▲	brucehoult 42 minutes ago \| parent [-]
		I don't know how they got their 3 GB/s memory bandwidth. My own testing shows 5347.7 MB/s on a 64 MiB to 64 MiB `memcpy()` using a basic 7 instruction RVV copy loop an X100 core. That's a total 10.7 GB/s memory bandwidth. The A100 "AI" cores do better, with 13225.9 MB/s on the 64 MiB to 64 MiB copy, for a total 26.5 GB/s memory bandwidth. Both core types do a 25 GB/s `memcpy()` total 50 GB/s in cache. On X100 cores: bruce@k3:~$ ./test_memcpy Byte size : ns Speed 0 : 6.3 0.0 MB/s 1 : 6.5 147.6 MB/s 2 : 6.5 295.7 MB/s 4 : 6.3 602.7 MB/s 8 : 6.4 1193.6 MB/s 16 : 6.4 2402.1 MB/s 32 : 6.4 4796.1 MB/s 64 : 7.1 8558.1 MB/s 128 : 7.1 17313.7 MB/s 256 : 12.6 19444.2 MB/s 512 : 20.8 23424.8 MB/s 1024 : 39.8 24563.3 MB/s 2048 : 80.4 24284.2 MB/s 4096 : 158.0 24722.1 MB/s 8192 : 312.5 24997.6 MB/s 16384 : 609.6 25630.4 MB/s 32768 : 1287.0 24281.6 MB/s 65536 : 2761.8 22630.4 MB/s 131072 : 6463.0 19340.9 MB/s 262144 : 12897.6 19383.5 MB/s 524288 : 25779.1 19395.6 MB/s 1048576 : 52356.4 19099.9 MB/s 2097152 : 111030.3 18013.1 MB/s 4194304 : 569240.2 7026.9 MB/s 8388608 : 1468409.2 5448.1 MB/s 16777216 : 2905474.6 5506.8 MB/s 33554432 : 5769324.2 5546.6 MB/s 67108864 : 11967851.6 5347.7 MB/s And on A100: bruce@k3:~$ ai ./test_memcpy Byte size : ns Speed 0 : 21.0 0.0 MB/s 1 : 82.7 11.5 MB/s 2 : 82.9 23.0 MB/s 4 : 82.9 46.0 MB/s 8 : 82.8 92.2 MB/s 16 : 82.9 184.2 MB/s 32 : 82.9 368.2 MB/s 64 : 87.2 699.7 MB/s 128 : 87.1 1401.7 MB/s 256 : 87.2 2799.1 MB/s 512 : 77.2 6326.1 MB/s 1024 : 82.9 11784.2 MB/s 2048 : 98.4 19855.9 MB/s 4096 : 193.5 20191.4 MB/s 8192 : 313.5 24916.8 MB/s 16384 : 627.0 24919.0 MB/s 32768 : 1254.2 24915.7 MB/s 65536 : 2508.0 24920.1 MB/s 131072 : 5017.3 24913.6 MB/s 262144 : 10036.5 24909.0 MB/s 524288 : 20075.0 24906.6 MB/s 1048576 : 62556.9 15985.4 MB/s 2097152 : 152324.5 13129.9 MB/s 4194304 : 303466.3 13181.0 MB/s 8388608 : 610230.0 13109.8 MB/s 16777216 : 1186394.5 13486.2 MB/s 33554432 : 2317591.8 13807.4 MB/s 67108864 : 4838988.3 13225.9 MB/s That's using the following `memcpy()` in both cases. `.globl memcpy memcpy: mv a3, a0 0: vsetvli a4, a2, e8, m4, ta, ma vle8.v v0, (a1) sub a2, a2, a4 add a1, a1, a4 vse8.v v0, (a3) add a3, a3, a4 bnez a2, 0b ret`