Remix.run Logo
lelanthran 3 hours ago

> any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token reques

So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".

zozbot234 2 hours ago | parent [-]

The best use is actually for a layer that "almost fits" into VRAM, such that automated offloading to system RAM will be rare enough that it doesn't impact performance.