Remix.run Logo
jazzyjackson 2 hours ago

At 10toks, are you using it interactively or do you submit a prompt and come back to it later? I always thought it would make sense to just do conversations over email, asynchronously, the model can take all the time it needs and get back to me when it has an answer.

ls612 an hour ago | parent [-]

10 tok/s is around the borderline of interactive being good. I did the math and it is mostly bottlenecked by memory bandwidth, so in the future I can expect to run a similarly sized model on my 4090 once it gets retired from gaming service and get ~25 tok/s which will be very usable.