▲ | mikewarot 8 days ago | |||||||
You know what's actually hard to find in all this? The actual dimensions of the arrays in the model GPT-OSS-120B. At least with statically typed languages, you know how big your arrays are at a glance. I'm trying to find it in the GitHub repo[1], and I'm not seeing it. I'm just trying to figure out how wide the datastream through this is, in particular, the actual data (not the weights) that flow through all of it. The width of the output stream. Just how big is a token at the output, prior to reducing it with "temperature" to a few bytes? Assume infinitely fast compute in a magic black box, but you have to send the output through gigabit ethernet... what's the maximum number of tokens per second? | ||||||||
▲ | amluto 8 days ago | parent | next [-] | |||||||
What’s the application where you want to stream out the logits for each consecutive token while still sampling each token according to the usual rule? Keep in mind that, if you are doing the usual clever tricks like restricting the next token sampled to something that satisfies a grammar, you need to process the logits and sample them and return a token before running the next round of inference. | ||||||||
| ||||||||
▲ | steeve 8 days ago | parent | prev [-] | |||||||
According to https://huggingface.co/openai/gpt-oss-120b/blob/main/config.... That’s 2880 values (so multiply by dtype) |