6 tokens per second?

Can you put up with that? As seems very slow. I aim for 40t/s on a laptop and choose models that deliver that speed over larger slower ones

▲

segmondy 2 hours ago | parent [-]

I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

▲

manmal an hour ago | parent | next [-]

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

	▲	all2 an hour ago \| parent [-]
		Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.

▲

froh 2 hours ago | parent | prev [-]

do you use caveman or similar?