▲ | spankalee 3 days ago | |
I'm curious... I'm not a database or AI engineer. The last time I did GPU work was over a decade ago. What is the point of the "saturate an H100" metric? I would think that a GPU isn't just sitting there waiting on a process that's in turn waiting for one query to finish to start the next query, but that a bunch of parallel queries and scans would be running, fed from many DB and object store servers, keeping the GPUs as utilized as possible. Given how expensive GPUs are, it would seem like a good trade to buy more servers to keep them fed, even if you do want to make the servers and DB/object store reads faster. | ||
▲ | otterley 3 days ago | parent | next [-] | |
The idea is that in a pipeline of work, throughput is limited by the slowest component. H100 GPUs have a lot of memory bandwidth. The question then becomes how to eliminate any bottlenecks between the data store and the GPU's memory. First is the storage bottleneck. Network-attached storage is usually a bottleneck for uncached data. Then there is CPU work decoding data. Spiral claims that their table format is ready to load by the GPU so they can bypass various CPU-bound decoding stages. Once you eliminate storage and CPU bottlenecks, the remaining bottleneck is usually the PCI bus that sits between the host memory and the GPU, and they can't solve that themselves. (And no amount of parallelization can help when the bus is saturated.) What they can do is use the network, the host bus, and the GPU more efficiently by compressing and packing data with greater mechanical sympathy. They've left unanswered how they're going to commercialize it, but my guess is that they're going to use a proprietary fork of Vortex that provides extra performance or features, or perhaps they'll offer commercial services or integrations that make it easier to use. The open-source release gives its customers a Reason to Believe, in marketing parlance. | ||
▲ | vouwfietsman 3 days ago | parent | prev [-] | |
My guess is that just the raw data size, combined with the physical limitations of your RU, makes it hard for the GPU to be fully utilized. Instead you will always be stuck on CPU (decompressing/interpreting/uploading parquet) or bandwidth (transfer from s3) being the bottleneck. Seems that they are targeting a low-to-no overhead path from s3 bucket to GPU, by targeting: same compression/faster random access, streamed encoding from S3 while in flight, zero copy to GPU. Not 100% clear on the details, but I doubt that they can actually saturate the cpu/gpu bus, but rather just saturate the GPU utilization, which is itself dependent on multiple possible bottlenecks but generally not on bus bandwidth. That's not criticism: it literally means you can't do better unless you improve the GPU utilization of your AI model. |