0.2 tok/s is slow for chat but perfectly fine for batch/async workloads. I run automated content generation pipelines where a single job kicks off dozens of LLM calls (script generation, metadata, descriptions) and none of them need to be interactive. The whole job takes 20 minutes anyway because of image generation bottlenecks. Being able to run a 70B model locally for those batch calls instead of paying per-token API costs would be a significant cost reduction, even at this speed.

▲

esquire_900 5 hours ago | parent | next [-]

Cost wise it does not seem very effective. .5 token / sec (the optimized one) is 3600 tokens an hour, which costs about 200-300 watts for an active 3090+system. Running 3600 tokens on open router @.4$ for llama 3.1 (3.3 costs less), is about $0,00144. That money buys you about 2-3 watts (in the Netherlands).

Great achievement for privacy inference nonetheless.

▲

teo_zero 3 hours ago | parent | next [-]

I think we use different units. In my system there are 3600 seconds per hour, and watts measure power.

	▲	IsTom 17 minutes ago \| parent [-]
		OP probably means watt-hours.

▲

Aerroon 4 hours ago | parent | prev [-]

Something to consider is that input tokens have a cost too. They are typically processed much faster than output tokens. If you have long conversations then input tokens will end up being a significant part of the cost.

It probably won't matter much here though.

▲

eleventyseven 4 hours ago | parent | prev [-]

Are you taking into account energy costs of running a 3090 at 350 watts for a very long time?

	▲	ekianjo an hour ago \| parent [-]
		You can run a RTX3090 at 250w and still get a lot of its performance with nvidia-smi.