| ▲ | akssassin907 2 hours ago | |
The pattern I found that works ,use a small local model (llama 3b via Ollama, takes only about 2GB) for heartbeat checks — it just needs to answer 'is there anything urgent?' which is a yes/no classification task, not a frontier reasoning task. Reserve the expensive model for actual work. Done right, it can cut token spend by maybe 75% in practice without meaningfully degrading the heartbeat quality. The tricky part is the routing logic — deciding which calls go to the cheap model and which actually need the real one. It can be a doozy — I've done this with three lobsters, let me know if you have any questions. | ||