Remix clone Hacker News

new | show | ask | jobs Github

	▲	nowittyusername 2 hours ago
		There are A LOT of misconceptions about llms, biggest one is they are not deterministic. And they are 100% deterministic and temperature has nothing to do with it. You WILL get exactly same result every single time (at ANY temperature) as long as you use same sampling parameters and server config parameters. What causes variance in LLM's is server parameters like batch processing and caching among a few other things possibly. the batching being responsible for most of the issues. The reason that flag is used is because large providers serve multiple customers per one gpu, and breaking up the vram is tricky and causes drift. If you start llama.cpp for example with only one person per slot batching off, you will always get same results every time even at temperature 1.2 or whatever other parameters because you are using one gpu per inferance call so no fucky buseness there. Reason most people are unaware of this is because most people have experience only with api instead of working with the actual inferance enjine itself so this godd damned myth keeps spreading. my vide for referance here where you can download and try for yourself. https://www.youtube.com/watch?v=EyE5BrUut2o
	▲	maplethorpe an hour ago \| parent [-]
		Thanks so much for this! I still haven't got around to building my own language model yet, so I'm a bit fuzzy on the details, but if I imagined a thought experiment where I did all the math by hand on paper, I just couldn't see how I would end up with a different output each time given the same inputs. Finding out that the variance other people are seeing comes from the server/hardware stuff clears that up. This is a surprisingly annoying question to Google. A lot of articles give the reason that softmax returns a probability distribution, as if the presence of the word "probability" means the tokens will be different every time.