Remix.run Logo
simianwords 3 days ago

How does "supporting 1M tokens" really work in practice? Is it a new model? Or did they just remove some hard coded constraint?

eldenring 3 days ago | parent [-]

Serving a model efficiently at 1M context is difficult and could be much more expensive/numerically tricky. I'm guessing they were working on serving it properly, since its the same "model" in scores and such.

simianwords 3 days ago | parent [-]

Thanks - still not clear what they did really. Some inference time hacks?

FergusArgyll 3 days ago | parent | next [-]

That would imply the model always had a 1m token context but they limited it in the api and app? That's strange because they can just charge more for every token past 250k (like google does, I believe).

But if not shouldn't it have to be completely retrained model? it's clearly not that - good question!

otabdeveloper4 3 days ago | parent | prev | next [-]

Most likely still 32k tokens under the hood, but with some context slicing/averaging hacks to make inference not error out on infinite input.

(That's what I do locally with llama.cpp)

Aeolun 3 days ago | parent | prev [-]

They already had 0.5M context window on the enteprise version.