| ▲ | hazrmard 2 days ago | |
This reflects my experience. Yet, I feel that getting reliability out of LLM calls with a while-loop harness is elusive. For example - how can I reliably have a decision block to end the loop (or keep it running)? - how can I reliably call tools with the right schema? - how can I reliably summarize context / excise noise from the conversation? Perhaps, as the models get better, they'll approach some threshold where my worries just go away. However, I can't quantify that threshold myself and that leaves a cloud of uncertainty hanging over any agentic loops I build. Perhaps I should accept that it's a feature and not a bug? :) | ||
| ▲ | nyellin 2 days ago | parent | next [-] | |
Forgot to address the easiest part: > - how can I reliably call tools with the right schema? This is typically done by enabling strict mode for tool calling which is a hermetic solution. Makes llm unable to generate tokens that would violate the schema. (I.e. LLM samples tokens only from the subset of tokens that lead to valid schema generation.) | ||
| ▲ | nyellin 2 days ago | parent | prev [-] | |
Re (1) use a TODOs system like Claude code. Re (2) also fairly easy! It's just a summarization prompt. E.g. this is the one we use in our agent: https://github.com/HolmesGPT/holmesgpt/blob/62c3898e4efae69b... Or just use the Claude Code SDK that does this all for you! (You can also use various provider-specific features for 2 like automatic compaction on OpenAI responses endpoint.) | ||