Remix.run Logo
atombender a day ago

I'm not sure I understand this argument. I create new tools all the time as part of my development work, and I have skills stored that tell agents how to use them. They use them flawlessly.

When I say "benchmark the query engine using the foobar dataset and compare it to run 431", the agents go and run my special benchmark tool and use the different subcommands to compare results and so on.

I'm sure a new VCS would be a little less smooth sailing, but not by much.

raincole 13 hours ago | parent | next [-]

> I'm not sure I understand this argument. I create new tools all the time as part of my development work, and I have skills stored that tell agents how to use them. They use them flawlessly.

I highly doubt that your tool is like this:

> git branch -vv | grep ': gone]'| grep -v "*" | awk '{ print $1; }' | xargs -r git branch -d

Or:

> ffmpeg -i main_course.mp4 -i reaction_cam.mov \ -filter_complex \ "[1:v]scale=480:270[pip_scaled]; \ [0:v][pip_scaled]overlay=W-w-20:20[pip_video]; \ [pip_video]drawtext=text='LIVE RECORDING':fontcolor=white:fontsize=24:box=1:boxcolor=black@0.6:x=30:y=30[final_video]; \ [0:a][1:a]amix=inputs=2:duration=first:dropout_transition=2[final_audio]" \ -map "[final_video]" -map "[final_audio]" \ -c:v libx264 -crf 21 -preset fast \ -c:a aac -b:a 192k \ output_production.mp4

LLMs generate these for breakfast.

cruffle_duffle 11 hours ago | parent [-]

It’s really wild watching LLMs construct those calls. They batch so many different checks and stuff into a single tool call, delimit them with markers, etc.

The crazy thing to me is that this kind of “composition of small tools to create something bigger” is the biggest vindication of the Unix philosophy I can think of.

I have to wonder how much of that behavior was trained into the model and how much it is the secret herbs and spices they toss into the harnesses own system prompts.

fireant 10 hours ago | parent | next [-]

Personally I really dislike when the agents generate super long composed shell commands because they are really hard to audit. ffmpeg I'd whitelist, but if it makes a mistake in some super long chained git command it can have pretty scary consequences.

yencabulator 4 hours ago | parent | prev [-]

Totally breaks the permission model in Claude Code.

sdesol a day ago | parent | prev | next [-]

I think the issues is, it is going against a very well established pattern. I have a tool that wraps ripgrep so that search results always includes context and from time to time, the agent will use ripgrep by itself and when I ask why, it would go "yeah I should have done that"

There are work arounds though and I am creating what I call knowledge triggers for Pi that are similar to claude's "PreToolUse" so having the agent use oak all the time is not an issue in my opinion.

The challenge for oak is why? Considering how I actually want to slow agents down so I can ensure it is doing the right thing and because the massive bottle kneck is the LLM themselves, speed when measured in milliseconds or even seconds will not concern many.

I thought oak was more of, we know how to prompt inject context based on code that is stored in oak for example, but faster operations can help, but the use case is limited. The missing piece for better/correct code is context at the right time.

nextaccountic a day ago | parent [-]

> I think the issues is, it is going against a very well established pattern. I have a tool that wraps ripgrep so that search results always includes context and from time to time, the agent will use ripgrep by itself and when I ask why, it would go "yeah I should have done that"

There's a limit of how many simultaneous instructions an agent can follow (the exact number depends on the specific model so instructions that are fine for one model may overwhelm another). If this keeps happening, consider trimming your instructions or even better, solving it at the harness level (like intercepting and rewriting ripgrep calls to use your thing, like rtk [0] does in agents that supports this)

Overall, never leave to an agent an instruction that must be followed at all times. For example, doing things in a git hook beats a multi-command workflow every time the agent commit, etc.

Is this state of things forever? I don't think so. Very soon models will become so better this will be a non-problem

[0] https://github.com/rtk-ai/rtk

dbt00 19 hours ago | parent | prev [-]

I use a new VCS already (jj, highly recommended) and Claude forgets to use it all the time despite many obvious instructions in many places.