Remix.run Logo
amluto 8 days ago

What’s the application where you want to stream out the logits for each consecutive token while still sampling each token according to the usual rule? Keep in mind that, if you are doing the usual clever tricks like restricting the next token sampled to something that satisfies a grammar, you need to process the logits and sample them and return a token before running the next round of inference.

mikewarot 8 days ago | parent [-]

I know the actual output of the model is wider than a token.... but I can't find it (the actual width, or number of bytes) in the source. Perhaps it's my very casual familiarity with Python that's limiting me, but I don't see any actual declarations of array sizes anywhere in the code.

I'm just trying to calculate the actual bandwidth required for the full output of the model, not just a token to be handed off to the user.

I need this so I can compute just what bandwidth a fully FPGA (later ASIC) based implementation of the model would result in.

Edit/Append: I asked GPT-5, and it estimated:

  Total bytes = 50,000 tokens × 4 bytes/token = 200,000 bytes
Which sounds about right to me. This yields a maximum of about 500 logits/second on Gigabit ethernet.

The actual compute of the model is peanuts compared to just shuffling the data around.