Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.

Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).

But my money is on the exact two mechanisms the OP proposes.

▲

anonymous908213 5 hours ago | parent [-]

> especially if quality degrades at all

It is worth noting that consumers are completely and totally incapable of detecting quality degradation with any accuracy. Which is a given since the models are already effectively random, but there is a strong bent to hallucinate degradations. Having done frontend work for an AI startup, complaints of degrading the model were by far the most common, despite the fact that not only did our model not change, users could easily verify that it didn't change because we expose seeds. A significant portion of complainers continue to complain about model degradation even when shown they could regenerate from the same seed+input and get the exact same output. Humans, at scale, are essentially incapable of comprehending the concept of randomness.

▲

x_may 3 hours ago | parent | next [-]

Wait sorry how did you use and expose seeds? That’s the most interesting part of your post

	▲	anonymous908213 19 minutes ago \| parent [-]
		We are not a ChatGPT wrapper; we use a finetuned open-source model running on our own hardware, so we naturally have full control of the input parameters. I apologize if my language was ambiguous, but by "expose seeds" I simply meant users can see the seed used for each prompt and input their own in the UI, rather than "exposing secrets" of the frontier LLM APIs, if that's what you took it to mean.

▲

Der_Einzige 4 hours ago | parent | prev [-]

You can jiggle sampling settings around without the seed changing. That’s identical in practice but even more sneaky. (Though it wouldn’t speed up inference unless they were dumb enough to do beam search and turned that off!!!)

Yeah they can’t tell, but also there’s lots of incentive for major LLM providers to lie about not doing something that would massively save their inference costs if they did.