| ▲ | brightball 4 hours ago | |||||||
What’s your setup? | ||||||||
| ▲ | adrian_b 3 hours ago | parent [-] | |||||||
For now, the most suitable computer that I have for running LLMs is an Epyc server with 128 GB DRAM and 2 AMD GPUs with 16 GB of HBM memory each. I have a few other computers with 64 GB DRAM each and with NVIDIA, Intel or AMD GPUs. Fortunately all that memory has been bought long ago, because today I could not afford to buy extra memory. However, a very short time ago, i.e. the previous week, I have started to work at modifying llama.cpp to allow an optimized execution with weights stored in SSDs, e.g. by using a couple of PCIe 5.0 SSDs, in order to be able to use bigger models than those that can fit inside 128 GB, which is the limit to what I have tested until now. By coincidence, this week there have been a few threads on HN that have reported similar work for running locally big models with weights stored in SSDs, so I believe that this will become more common in the near future. The speeds previously achieved for running from SSDs hover around values from a token at a few seconds to a few tokens per second. While such speeds would be low for a chat application, they can be adequate for a coding assistant, if the improved code that is generated compensates the lower speed. | ||||||||
| ||||||||