| ▲ | liuliu 4 hours ago | |
Realistically, you need to experiment with any user prompt + a good amount of system prompt (at least > 1000 tokens, but realistically, in the range of 3000 tokens probably good). llama.cpp includes tools for that, what you are looking at is to have a prefill before token generation to measure it properly. Increasingly also, measuring token generation speed at longer context (32k or 64k) is important too. | ||