| ▲ | xaskasdf 9 hours ago | |||||||
yeah, actually I wanted to see if this was possible at all. I managed to get around 3000 tokens/s on a ps2 with classic transformers, since the emotion engine is capable of 32 bit addresses, but it has like 32gb of ram. So I ran into the question of why was that fast and I couldn't get that speed even with small models, and the deal is that the instructions went right of the memory to the gpu and that's the main difference that does when a regular computer does inference: it has to request the instructions to the cpu every time. As I mentioned too, on professional cards you can avoid these problems naturally, since they got instructions precisely for this, but sadly I don't have 30k bucks to spare on a gpu :( | ||||||||
| ▲ | derstander 8 hours ago | parent | next [-] | |||||||
*32MB of RAM (plus 4MB of video RAM and a little sound and IOP memory). | ||||||||
| ▲ | eleventyseven 4 hours ago | parent | prev | next [-] | |||||||
> I don't have 30k bucks to spare on a gpu :( Do you have $2/hr to rent an RTX 6000 96GB or $5/hr for B200 180GB on the cloud? | ||||||||
| ||||||||
| ▲ | anoncow 5 hours ago | parent | prev [-] | |||||||
3000 tokens per sec on 32 mb Ram? | ||||||||
| ||||||||