Remix.run Logo
varispeed 5 days ago

So would 40x RPi 5 get 130 token/s?

SillyUsername 5 days ago | parent | next [-]

I imagine it might be limited by number of layers and you'll get diminishing returns as well at some point caused by network latency.

reilly3000 5 days ago | parent | prev | next [-]

It has to be 2^n nodes and limited to one per attention head that the model has.

VHRanger 5 days ago | parent | prev [-]

Most likely not because of NUMA bottlenecks