| ▲ | mechagodzilla 4 hours ago |
| I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming. |
|
| ▲ | tyre 4 hours ago | parent [-] |
| Given the cost of the system, how long would it take to be less expensive than, for example, a $200/mo Claude Max subscription with Opus running? |
| |
| ▲ | Workaccount2 an hour ago | parent | next [-] | | Never, local models are for hobby and (extreme) privacy concerns. A less paranoid and much more economically efficient approach would be to just lease a server and run the models on that. | |
| ▲ | mechagodzilla 3 hours ago | parent | prev [-] | | It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs. |
|