▲ | diggan 3 days ago | |||||||||||||
> Not to mention, industry consensus is that the "smallest good" models start out at 70-120 billion parameters. At a 64k token window, that easily gets into the 80+ gigabyte of video memory range, which is completely unsustainable for individuals to host themselves. Worth a tiny addendum, GPT-OSS-120b (at mxfp4 with 131,072 context size) lands at about ~65GB of VRAM, which is still large but at least less than 80GB. With 2x 32GB GPUs (like R9700, ~1300USD each) and slightly smaller context (or KV cache quantization), I feel like you could fit it, and becomes a bit more obtainable for individuals. 120b with reasoning_effort set to high is quite good as far as I've tested it, and blazing fast too. | ||||||||||||||
▲ | xena 3 days ago | parent | next [-] | |||||||||||||
For what it's worth, I probably should have used "consumers" there. I'll edit it later. | ||||||||||||||
| ||||||||||||||
▲ | refulgentis 3 days ago | parent | prev [-] | |||||||||||||
I have to wonder if its missing forest for the trees: do you perceive GPT-OSS-120b as an emotionally warm model? (FWIW this reply may be beneath your comment, but not necessarily voiced to you, the quoted section jumped over it too, direct from 5 isn't warm, to 4o-non-reasoning is, to the math on self-hosting a reasoning model) Additionally, author: I maintain a llama.cpp-based app on several platforms for a couple years now, I am not sure how to arrive at 4096 tokens = 3 GB, it's off by an OOM AFAICT. | ||||||||||||||
|