Remix.run Logo
mikewarot 8 days ago

You know what's actually hard to find in all this? The actual dimensions of the arrays in the model GPT-OSS-120B. At least with statically typed languages, you know how big your arrays are at a glance. I'm trying to find it in the GitHub repo[1], and I'm not seeing it.

I'm just trying to figure out how wide the datastream through this is, in particular, the actual data (not the weights) that flow through all of it. The width of the output stream. Just how big is a token at the output, prior to reducing it with "temperature" to a few bytes?

Assume infinitely fast compute in a magic black box, but you have to send the output through gigabit ethernet... what's the maximum number of tokens per second?

[1] https://github.com/openai/gpt-oss/tree/main/gpt_oss

amluto 8 days ago | parent | next [-]

What’s the application where you want to stream out the logits for each consecutive token while still sampling each token according to the usual rule? Keep in mind that, if you are doing the usual clever tricks like restricting the next token sampled to something that satisfies a grammar, you need to process the logits and sample them and return a token before running the next round of inference.

mikewarot 8 days ago | parent [-]

I know the actual output of the model is wider than a token.... but I can't find it (the actual width, or number of bytes) in the source. Perhaps it's my very casual familiarity with Python that's limiting me, but I don't see any actual declarations of array sizes anywhere in the code.

I'm just trying to calculate the actual bandwidth required for the full output of the model, not just a token to be handed off to the user.

I need this so I can compute just what bandwidth a fully FPGA (later ASIC) based implementation of the model would result in.

Edit/Append: I asked GPT-5, and it estimated:

  Total bytes = 50,000 tokens × 4 bytes/token = 200,000 bytes
Which sounds about right to me. This yields a maximum of about 500 logits/second on Gigabit ethernet.

The actual compute of the model is peanuts compared to just shuffling the data around.

steeve 8 days ago | parent | prev [-]

According to https://huggingface.co/openai/gpt-oss-120b/blob/main/config....

That’s 2880 values (so multiply by dtype)