Remix.run Logo
oktoberpaard 4 days ago

It gives weird results for me. I’m using Qwen3-32B with 32K context length at Q4_K_M, with 8 bit KV cache fully offloaded to 24GB VRAM. According to this calculator this should be impossible by a large margin, yet it’s working for me.

Edit: this might be because I’ve got flash attention enabled in Ollama.