| ▲ | ggerganov 2 hours ago | |
Here are the prefill speeds:
Btw, based on your numbers, I think our use cases are quite different. I use the agent for very targeted sessions - basically things that are clear to me how to do, just want to automate them. My workflow is usually: new session -> read this, this and this -> do that. I.e. I don't let it wander at all in the codebase, so I rarely exceed the context window.Also, I get a lot of mileage from the ngram-based speculative decoding functionality [0] as it allows me to iterate on the implementation much faster. | ||
| ▲ | kpw94 2 hours ago | parent [-] | |
Thanks! Super helpful. I do use it the same way as you're describing on personal projects at home, in a very crude manner (pasting code snippets in llama server web UI prompt. Next will attempt OpenCode) At work I use it in similar manner with more mature tools, but the vast majority of token spend comes from a totally different workflow: "pretend the AI is a fleet of junior/intern engineer you're delegating work to", where the agent will on its own do the implementation, commit the changes etc. It does indeed spend a lot of tokens wandering the codebase, talking to MCPs, loading skills etc. | ||