1. Is the given tok/s estimate for the total node throughput, or is it what you can realistically expect to get? Or is it the worst case scenario throughput if everyone starts to use it simultaneously?

2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?

Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.

▲

jrandolf 10 hours ago | parent [-]

1. It's an average. 2. We have sophisticated rate limiter.

▲

poly2it 9 hours ago | parent [-]

Does it take user time zones into account?

	▲	jrandolf 9 hours ago \| parent [-]
		Yes