▲ | lrvick 5 days ago | |||||||
Job one is have every bit of software involved also be deterministic, which stagex takes care of. I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago. Run two of these with the same prompts and same seed and you get the same results. Obviously in GPU clusters with different hardware things get more complicated. | ||||||||
▲ | spindump8930 5 days ago | parent | next [-] | |||||||
That's not what this is about. "I had no problem getting deterministic LLM outputs when I experimented with this 6 months ago" looks like you're using llama-cpp in that repo. This is about vllm serving many requests at once, at long sequence lengths. > As it turns out, our request’s output does depend on the parallel user requests. Not because we’re somehow leaking information across batches — instead, it’s because our forward pass lacks “batch invariance”, causing our request’s output to depend on the batch size of our forward pass. Your situation isn't really comparable. | ||||||||
▲ | saagarjha 5 days ago | parent | prev [-] | |||||||
What’s stagex? | ||||||||
|