| ▲ | SillyUsername 5 hours ago | |
Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM. Running TDP about 200W non peak 550W peak (everything slammed, but I've never seen it and I've an AC monitor on the socket). GLM 4.5 Air (60GB Q3-XL) when properly tuned runs at 8.5 to 10 tokens / second, with context size of 8K. Throw in a P100 too and you'll see 11-12.5 t/s (still tuning this one). Performance doesn't drop as much for larger model sizes as the internode communication and DDR4 2400 is the limiter, not the GPUs. I've been using this with 4 channel 96GB ram, recently updated to 128GB. | ||
| ▲ | Aurornis 4 hours ago | parent [-] | |
> Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM. Excluding RAM in your pricing is misleading right now. That’s a lot of work and money just to get 10 tokens/sec | ||