Remix.run Logo
112233 8 days ago

Are there any easy to use inference frontends that support rewriting/pruning the context? Also, ideally, masking out chunks of kv-cache (e.g. old think blocks)?

Because I cannot find anything short of writing custom fork/app on top of hf transformers or llama.cpp

diggan 8 days ago | parent [-]

I tend to use my own "prompt management CLI" (https://github.com/victorb/prompta) to setup somewhat reusable prompts, then paste the output into whatever UI/CLI I use at the moment.

Then rewriting/pruning is a matter of changing the files on disk, rerun "prompta output", create a new conversion. I basically never go beyond one user message and one assistant message, seems to degrade really quickly otherwise.