Remix.run Logo
djhworld 3 days ago

I watched the video and enjoyed it, I think the most interesting part to me was running the distributed Llama.cpp, Jeff mentioned it seems to work in a linear fashion where processing would hop between nodes.

Which got me thinking about how do these frontier AI models work when you (as a user) run a query. Does your query just go to one big box with lots of GPUs attached and it runs in a similar way, but much faster? Do these AI companies write about how their infra works?

geerlingguy 3 days ago | parent [-]

ServeTheHome has a few videos covering AI servers and interconnects.

And yes, they basically have 1 Tbps+ interconnects and throw tens or hundreds of GPUs at queries. Nvidia was wise to invest so much in their networking side—they have massive bandwidth between machines and shared memory, so they can run massive models with tons of cards, with minimal latency.

It's still not as good as tons of GPU attached to tons of memory on _one_ machine, but it's better than 10, 25, or 40 Gbps networking that most small homelabs would run.