> Not to mention, industry consensus is that the "smallest good" models start out at 70-120 billion parameters. At a 64k token window, that easily gets into the 80+ gigabyte of video memory range, which is completely unsustainable for individuals to host themselves.

Worth a tiny addendum, GPT-OSS-120b (at mxfp4 with 131,072 context size) lands at about ~65GB of VRAM, which is still large but at least less than 80GB. With 2x 32GB GPUs (like R9700, ~1300USD each) and slightly smaller context (or KV cache quantization), I feel like you could fit it, and becomes a bit more obtainable for individuals. 120b with reasoning_effort set to high is quite good as far as I've tested it, and blazing fast too.

▲

xena 3 days ago | parent | next [-]

For what it's worth, I probably should have used "consumers" there. I'll edit it later.

	▲	gyomu 3 days ago \| parent [-]
		If you made a "Replika in a box" product which, for $3k, gave you unlimited Replika forever - guaranteed to never be discontinued by its creators — I think a not so tiny amount of people would purchase without thinking. Given how obsessive these users seem to be about the product, $3k is far from a crazy amount of money.

▲

refulgentis 3 days ago | parent | prev [-]

I have to wonder if its missing forest for the trees: do you perceive GPT-OSS-120b as an emotionally warm model?

(FWIW this reply may be beneath your comment, but not necessarily voiced to you, the quoted section jumped over it too, direct from 5 isn't warm, to 4o-non-reasoning is, to the math on self-hosting a reasoning model)

Additionally, author: I maintain a llama.cpp-based app on several platforms for a couple years now, I am not sure how to arrive at 4096 tokens = 3 GB, it's off by an OOM AFAICT.

	▲	xena 3 days ago \| parent \| next [-]
		I was going off of what I could directly observe on my M3 Max MacBook Pro running Ollama. I was comparing the model weights file on disk with the amount that `ollama ps` reported with a 4k context window.
	▲	diggan 3 days ago \| parent \| prev [-]
		> I have to wonder if its missing forest for the trees: do you perceive GPT-OSS-120b as an emotionally warm model? I haven't needed it to be "emotionally warm" for the use cases I use it for, but I'm guessing you could steer it via the developer/system messages to be sufficiently warm, depending on exactly what use case you had in mind.