I have been putting up with it forever. We are spoiled by MixtureOfExperts. Folks were delighted to run llama3-70B at such speed. We were happy with 15-20tk/sec with 8b models, and if you could run llama3-405B at 1tk/sec you were a god. To each their own. I can live with 6 high quality tokens. If I could get a Fable equivalent model, I'll gladly take 2tk/sec if that's what it took to run it locally.

▲

manmal 2 hours ago | parent | next [-]

But what is it doing for you that you couldn’t do yourself at that speed? I‘m really curious and on the fence of partly going local.

	▲	all2 an hour ago \| parent \| next [-]
		Is think you would use it more like email and less like text messages, so the domain of communication shifts drastically. The other part is, you don't have to run just that model, you can offload a lot of chores to smaller models.
	▲	Mashimo 4 minutes ago \| parent \| prev [-]
		Run one task, while you do another? Or while you sleep / eat / rave?

▲

froh 2 hours ago | parent | prev [-]

do you use caveman or similar?