the local side of things with an $7,000 - $10,000 machine (512gb fast memory, cpu and disk) can almost reach parity with regard to text input and output and 'reasoning', but lags far behind for multimodal anything: audio input, voice output, image input, image output, document input.

there are no out the box solutions to run a fleet of models simultaneously or containerized either

so the closed source solutions in the cloud are light years ahead and its been this way for 15 months now, no signs of stopping

▲

omneity 5 days ago | parent [-]

Would running vLLM in docker work for you, or do you have other requirements?

	▲	yieldcrv 5 days ago \| parent [-]
		its not an image and audio model, so I believe it wouldn't work for me by itself would probably need multiple models running in distinct containers, with another process coordinating them