Remix clone Hacker News

new | show | ask | jobs Github

	▲	ggerganov 2 hours ago
		Here are the prefill speeds: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32109 MiB \| model \| size \| params \| backend \| fa \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| -------- \| --: \| --------------: \| -------------------: \| \| qwen35 27B Q4_K - Medium \| 15.92 GiB \| 27.32 B \| CUDA \| 1 \| pp2048 @ d512 \| 3714.02 ± 10.85 \| \| qwen35 27B Q4_K - Medium \| 15.92 GiB \| 27.32 B \| CUDA \| 1 \| pp2048 @ d1024 \| 3684.86 ± 15.21 \| \| qwen35 27B Q4_K - Medium \| 15.92 GiB \| 27.32 B \| CUDA \| 1 \| pp2048 @ d2048 \| 3650.80 ± 8.53 \| \| qwen35 27B Q4_K - Medium \| 15.92 GiB \| 27.32 B \| CUDA \| 1 \| pp2048 @ d8192 \| 3473.88 ± 0.97 \| \| qwen35 27B Q4_K - Medium \| 15.92 GiB \| 27.32 B \| CUDA \| 1 \| pp2048 @ d32768 \| 2754.69 ± 4.07 \| ggml_metal_device_init: GPU name: MTL0 (Apple M2 Ultra) \| model \| size \| params \| backend \| fa \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| -------- \| -: \| --------------: \| -------------------: \| \| qwen35 27B Q8_0 \| 26.62 GiB \| 26.90 B \| MTL \| 1 \| pp2048 @ d512 \| 379.75 ± 0.21 \| \| qwen35 27B Q8_0 \| 26.62 GiB \| 26.90 B \| MTL \| 1 \| pp2048 @ d1024 \| 377.15 ± 0.35 \| \| qwen35 27B Q8_0 \| 26.62 GiB \| 26.90 B \| MTL \| 1 \| pp2048 @ d2048 \| 371.46 ± 0.91 \| \| qwen35 27B Q8_0 \| 26.62 GiB \| 26.90 B \| MTL \| 1 \| pp2048 @ d8192 \| 344.84 ± 0.41 \| \| qwen35 27B Q8_0 \| 26.62 GiB \| 26.90 B \| MTL \| 1 \| pp2048 @ d32768 \| 222.42 ± 5.29 \| Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window. Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster. [0] https://github.com/ggml-org/llama.cpp/pull/19164
	▲	kpw94 2 hours ago \| parent [-]
		Thanks! Super helpful. I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode) At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc. It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc.