Remix.run Logo
brucehoult 2 hours ago

I don't know how they got their 3 GB/s memory bandwidth.

My own testing shows 5347.7 MB/s on a 64 MiB to 64 MiB `memcpy()` using a basic 7 instruction RVV copy loop an X100 core. That's a total 10.7 GB/s memory bandwidth.

The A100 "AI" cores do better, with 13225.9 MB/s on the 64 MiB to 64 MiB copy, for a total 26.5 GB/s memory bandwidth.

Both core types do a 25 GB/s `memcpy()` total 50 GB/s in cache.

On X100 cores:

    bruce@k3:~$ ./test_memcpy 
    Byte size :              ns     Speed
            0 :             6.3       0.0 MB/s
            1 :             6.5     147.6 MB/s
            2 :             6.5     295.7 MB/s
            4 :             6.3     602.7 MB/s
            8 :             6.4    1193.6 MB/s
           16 :             6.4    2402.1 MB/s
           32 :             6.4    4796.1 MB/s
           64 :             7.1    8558.1 MB/s
          128 :             7.1   17313.7 MB/s
          256 :            12.6   19444.2 MB/s
          512 :            20.8   23424.8 MB/s
         1024 :            39.8   24563.3 MB/s
         2048 :            80.4   24284.2 MB/s
         4096 :           158.0   24722.1 MB/s
         8192 :           312.5   24997.6 MB/s
        16384 :           609.6   25630.4 MB/s
        32768 :          1287.0   24281.6 MB/s
        65536 :          2761.8   22630.4 MB/s
       131072 :          6463.0   19340.9 MB/s
       262144 :         12897.6   19383.5 MB/s
       524288 :         25779.1   19395.6 MB/s
      1048576 :         52356.4   19099.9 MB/s
      2097152 :        111030.3   18013.1 MB/s
      4194304 :        569240.2    7026.9 MB/s
      8388608 :       1468409.2    5448.1 MB/s
     16777216 :       2905474.6    5506.8 MB/s
     33554432 :       5769324.2    5546.6 MB/s
     67108864 :      11967851.6    5347.7 MB/s
And on A100:

    bruce@k3:~$ ai ./test_memcpy 
    Byte size :              ns     Speed
            0 :            21.0       0.0 MB/s
            1 :            82.7      11.5 MB/s
            2 :            82.9      23.0 MB/s
            4 :            82.9      46.0 MB/s
            8 :            82.8      92.2 MB/s
           16 :            82.9     184.2 MB/s
           32 :            82.9     368.2 MB/s
           64 :            87.2     699.7 MB/s
          128 :            87.1    1401.7 MB/s
          256 :            87.2    2799.1 MB/s
          512 :            77.2    6326.1 MB/s
         1024 :            82.9   11784.2 MB/s
         2048 :            98.4   19855.9 MB/s
         4096 :           193.5   20191.4 MB/s
         8192 :           313.5   24916.8 MB/s
        16384 :           627.0   24919.0 MB/s
        32768 :          1254.2   24915.7 MB/s
        65536 :          2508.0   24920.1 MB/s
       131072 :          5017.3   24913.6 MB/s
       262144 :         10036.5   24909.0 MB/s
       524288 :         20075.0   24906.6 MB/s
      1048576 :         62556.9   15985.4 MB/s
      2097152 :        152324.5   13129.9 MB/s
      4194304 :        303466.3   13181.0 MB/s
      8388608 :        610230.0   13109.8 MB/s
     16777216 :       1186394.5   13486.2 MB/s
     33554432 :       2317591.8   13807.4 MB/s
     67108864 :       4838988.3   13225.9 MB/s
That's using the following `memcpy()` in both cases.

    .globl memcpy
    memcpy:
            mv      a3, a0
    0:      vsetvli a4, a2, e8, m4, ta, ma
            vle8.v  v0, (a1)
            sub     a2, a2, a4
            add     a1, a1, a4
            vse8.v  v0, (a3)
            add     a3, a3, a4
            bnez    a2, 0b
            ret