Cost wise it does not seem very effective. .5 token / sec (the optimized one) is 3600 tokens an hour, which costs about 200-300 watts for an active 3090+system. Running 3600 tokens on open router @.4$ for llama 3.1 (3.3 costs less), is about $0,00144. That money buys you about 2-3 watts (in the Netherlands).

Great achievement for privacy inference nonetheless.

▲

teo_zero 3 hours ago | parent | next [-]

I think we use different units. In my system there are 3600 seconds per hour, and watts measure power.

	▲	IsTom 17 minutes ago \| parent [-]
		OP probably means watt-hours.

▲

Aerroon 4 hours ago | parent | prev [-]

Something to consider is that input tokens have a cost too. They are typically processed much faster than output tokens. If you have long conversations then input tokens will end up being a significant part of the cost.

It probably won't matter much here though.