▲ | 112233 8 days ago | |
Are there any easy to use inference frontends that support rewriting/pruning the context? Also, ideally, masking out chunks of kv-cache (e.g. old think blocks)? Because I cannot find anything short of writing custom fork/app on top of hf transformers or llama.cpp | ||
▲ | diggan 8 days ago | parent [-] | |
I tend to use my own "prompt management CLI" (https://github.com/victorb/prompta) to setup somewhat reusable prompts, then paste the output into whatever UI/CLI I use at the moment. Then rewriting/pruning is a matter of changing the files on disk, rerun "prompta output", create a new conversion. I basically never go beyond one user message and one assistant message, seems to degrade really quickly otherwise. |