I don't know how they got their 3 GB/s memory bandwidth.
My own testing shows 5347.7 MB/s on a 64 MiB to 64 MiB `memcpy()` using a basic 7 instruction RVV copy loop an X100 core. That's a total 10.7 GB/s memory bandwidth.
The A100 "AI" cores do better, with 13225.9 MB/s on the 64 MiB to 64 MiB copy, for a total 26.5 GB/s memory bandwidth.
Both core types do a 25 GB/s `memcpy()` total 50 GB/s in cache.
On X100 cores:
bruce@k3:~$ ./test_memcpy
Byte size : ns Speed
0 : 6.3 0.0 MB/s
1 : 6.5 147.6 MB/s
2 : 6.5 295.7 MB/s
4 : 6.3 602.7 MB/s
8 : 6.4 1193.6 MB/s
16 : 6.4 2402.1 MB/s
32 : 6.4 4796.1 MB/s
64 : 7.1 8558.1 MB/s
128 : 7.1 17313.7 MB/s
256 : 12.6 19444.2 MB/s
512 : 20.8 23424.8 MB/s
1024 : 39.8 24563.3 MB/s
2048 : 80.4 24284.2 MB/s
4096 : 158.0 24722.1 MB/s
8192 : 312.5 24997.6 MB/s
16384 : 609.6 25630.4 MB/s
32768 : 1287.0 24281.6 MB/s
65536 : 2761.8 22630.4 MB/s
131072 : 6463.0 19340.9 MB/s
262144 : 12897.6 19383.5 MB/s
524288 : 25779.1 19395.6 MB/s
1048576 : 52356.4 19099.9 MB/s
2097152 : 111030.3 18013.1 MB/s
4194304 : 569240.2 7026.9 MB/s
8388608 : 1468409.2 5448.1 MB/s
16777216 : 2905474.6 5506.8 MB/s
33554432 : 5769324.2 5546.6 MB/s
67108864 : 11967851.6 5347.7 MB/s
And on A100: bruce@k3:~$ ai ./test_memcpy
Byte size : ns Speed
0 : 21.0 0.0 MB/s
1 : 82.7 11.5 MB/s
2 : 82.9 23.0 MB/s
4 : 82.9 46.0 MB/s
8 : 82.8 92.2 MB/s
16 : 82.9 184.2 MB/s
32 : 82.9 368.2 MB/s
64 : 87.2 699.7 MB/s
128 : 87.1 1401.7 MB/s
256 : 87.2 2799.1 MB/s
512 : 77.2 6326.1 MB/s
1024 : 82.9 11784.2 MB/s
2048 : 98.4 19855.9 MB/s
4096 : 193.5 20191.4 MB/s
8192 : 313.5 24916.8 MB/s
16384 : 627.0 24919.0 MB/s
32768 : 1254.2 24915.7 MB/s
65536 : 2508.0 24920.1 MB/s
131072 : 5017.3 24913.6 MB/s
262144 : 10036.5 24909.0 MB/s
524288 : 20075.0 24906.6 MB/s
1048576 : 62556.9 15985.4 MB/s
2097152 : 152324.5 13129.9 MB/s
4194304 : 303466.3 13181.0 MB/s
8388608 : 610230.0 13109.8 MB/s
16777216 : 1186394.5 13486.2 MB/s
33554432 : 2317591.8 13807.4 MB/s
67108864 : 4838988.3 13225.9 MB/s
That's using the following `memcpy()` in both cases. .globl memcpy
memcpy:
mv a3, a0
0: vsetvli a4, a2, e8, m4, ta, ma
vle8.v v0, (a1)
sub a2, a2, a4
add a1, a1, a4
vse8.v v0, (a3)
add a3, a3, a4
bnez a2, 0b
ret