I've found that setting good guardrails, and running in a sandbox so that the agent doesn't keep asking tedious permission questions, makes things go a LOT smoother.

Generally, I spend anywhere between 15 mins and an hour setting things up (depending on how well the project is set up for AI work), and then set the agent going, coming back in a half-hour to an hour to check its progress. Generally, the tooling keeps it honest (for golang, forbidigo is AWESOME). 80% of the questions the agent asks me require a lot of thought. 20% of what it does needs correction.

The other thing to remember with LLMs is that they are NOT human, and won't react in a human way. So you'll see strikes of "brilliance" followed by the absolutely bizarre. But good guardrails keep that to a minimum.

▲

elevation 5 hours ago | parent | next [-]

> sandbox so that the agent doesn't keep asking tedious permission questions

> 80% of the questions the agent asks me require a lot of thought. 20% of what it does needs correction.

I've found even the permissions questions give me veto power over fruitless lines of exploration, especially in planning mode. For instance, it wants to use tools I don't have installed to access information that I have made available elsewhere? I get a chance to override this decision by declining the permissions check and redirecting it. Feels tedious, but helps me understand what information sources are influencing it. I head off a lot of bugs this way.

	▲	kstenerud 5 hours ago \| parent \| next [-]
		I never let it go into planning mode, other than to output a plan file that I can audit before giving it the go-ahead to implement. After that I don't want to be bothered, so --dangerously-skip-permissions keeps all but real questions out of the loop, and I can do something else while it works rather than babysit.
	▲	kqr 2 hours ago \| parent \| prev [-]
		This is my impression too. Whenever it needs permissions outside of a small set of defaults I've allowed, it's often because it's trying to do something ridiculous that it doesn't need to do. I think the yoloist counter-argument is "So what? Let it. It'll take longer that way and consume more tokens, but you can work on something else in parallel instead of being hooked in to this one session".

▲

jayd16 4 hours ago | parent | prev | next [-]

How often are you going into new projects and spending up to an hour on set up? I'm really just asking to get a sense of what "Generally" means here.

	▲	kstenerud 3 hours ago \| parent [-]
		I do it with every project I go into. First step is setting up the documentation so that the agent can navigate it quickly, knows the idioms, knows the test gating procedure, design principles, coding standards, testing policies, etc. Once that's set up, I spend time laying out the planning for whatever feature or fix is being worked on. For fixes the agent is pretty quick and usually needs little guidance. For new features it's best to have more of a hand on the tiller.

▲

epolanski 5 hours ago | parent | prev | next [-]

It doesn't change the premise.

AI should be assisting us, instead it's doing the job and it's us being an assistant to it. This is a monumental shift that people seem to be missing in how knowledge working is changing and it's going beyond mere coding.

Guardrails, prompts, whatever, it's us helping it doing the job, not the other way around.

Opus 4.6 was the last genuinely good assistant LLM, but since then it's quite clear that the training/reinforcement is focused "given prompt -> do task" so it's behavior is more and more about doing it itself, not helping you. If you try to use it as an assistant it just sucks and is perma wired into finding the solution. Many times I want it to help me investigate, and his answer will still be focused on the fix, not answering my questions.

4.7 first, 4.8 later and fable are absolute disasters as assistants.

Fable in particular is so "intelligent" that it will push with very strong and intelligent takes even if it is completely wrong.

I have never disliked our job more.

▲

kstenerud 5 hours ago | parent | next [-]

Wow... Our experiences have been very different, then. I've found each upgrade of Opus to be a noticeable improvement in its complex reasoning and delegation capabilities over its predecessor.

To me, this feels in many ways like a technical manager or team lead's job, where I guide the process along using my knowledge and experience, and then let the agent fill in the rest (to the best of its ability).

The agent can't really learn from its mistakes (at least, not without consuming precious context), so I apply a blameless postmortem process, updating the guardrails whenever it goes astray in the same way more than once.

And really, I'd rather be contemplating the more difficult and interesting questions of architecture, environment, ergonomics and market fit, so it suits me fine.

▲

mwigdahl 5 hours ago | parent | next [-]

Same here. The power upgrade going to Fable in particular is quite impressive.

▲

epolanski 4 hours ago | parent | prev [-]

> Wow... Our experiences have been very different, then. I've found each upgrade of Opus to be a noticeable improvement in its complex reasoning and delegation capabilities over its predecessor.

I haven't stated that it's not more capable nor more "intelligent", it's the opposite.

I will try to expand on what I mean.

LLMs "character/persona/tendencies" are increasingly less about acting as an assistant and more about finding the solution itself.

I use AI in a specific way: he assists, investigates and answers my question. I do the coding. It is increasingly difficult to use it as such, because it quickly jumps into giving me solutions instead of answering my specific questions.

I'll give you few examples.

I asked it to investigate DNS handling details in phoenix emailer module work, he did very little investigation and jumped into why I should've used magic links instead. Instead of assisting me in my research, it was hard wired to solve the problem (the wrong one, with a very wrong solution).

Today at work, I had a problem with batching, I wanted to understand if batching was even needed at all for our use case, and he kept circling around how to fix the batching bug instead. That's not what I asked it to do, yet, it jumped to the "solution".

I am increasingly frustrated by these models "personality" and tendencies that are unhelpful to assist me doing the task at hand and more on it doing it and me merely assisting/supervising.

Sure, very detailed prompting on how he has to act helps, but wait few turns and he drifts again to his default solution vomiting state.

Which makes me think that these models are hard wired on this mode of operation by consistent training and reinforcement of jumping from prompt to code solution.

	▲	kstenerud 4 hours ago \| parent [-]
		Ah yes, the agents by default are very "implementation" oriented, which is why I instruct mine to never implement something without formulating a plan first for me to approve. Another thing they tend to do is rely on their own context -> memories -> training data. And if that's wrong then they'll continue with it until you instruct them to research, after which they usually get the right answer. I've noticed that the newer models keep track of what you type so as to anticipate what you're likely to say. For example, today Opus 4.8 said "You usually don't want me to commit until you've checked, so the change remains uncommitted."

▲

taeric 5 hours ago | parent | prev | next [-]

I think this is just a misunderstanding of how most technology has always worked?

Consider what is happening in most construction sites. The heavy work is absolutely from the technology on site. But without people there to oversee it and keep it working, it would fail.

And that is almost certainly true at any industrial site. Indeed, look up videos of high tech looms. A large portion of the technology added to them are so that the operators can locate the fault and fix it.

▲

senordevnyc 4 hours ago | parent | prev | next [-]

AI should be assisting us, instead it's doing the job and it's us being an assistant to it.

If you're a manager and you ask a report to do something and they come back with a question, does that mean you're now their assistant?

I give agents the tasks, I answer their questions, I make choices about the tradeoffs in their plan, I supervise their implementation, I review their output, I have them walk me through things. In what way is this not delegating to them and managing their work, just like a more junior employee?

▲

rmunn 5 hours ago | parent | prev | next [-]

The problem (okay, one of the problems) with renting other people's models is, as you mentioned, that they can and will change out the model without notifying you ahead of time, and you don't always get to control which model you use. (They might decide to retire it, and you won't be able to get it back if they do).

Which is why (well, part of why) I think the long-term trend will be towards self-hosting models. Right now the frontier models are far enough ahead of the self-hosted ones that there are lots of people willing to pay by the token to rent someone else's model, because they get more value for money from that than from self-hosting models.

But the frontier companies won't be able to keep up their current levels of expenditure forever. At some point the investors are going to say "Hey, so, um, when am I going to see some return on my investment?" and then the current subsidized subscriptions (including the one my employer uses) are going to go away, much like what happened with Copilot this month.

And then the locally-hosted models are going to suddenly look like a more attractive picture. Because where you might have been willing to spend $100/month/employee to rent time on models in someone else's data center, you might suddenly balk at spending $500/month/employee. You might say "Hey, you know what? A $50,000 up-front capital investment is only, what, one month's worth of subscriptions for our 100 employees? Yeah, okay, I'll approve the hardware purchase. Get that self-hosted model set up and then we'll cancel the subscription and switch over."

Not everyone is going to do that. But once the locally-hosted models are good enough, the first few people who do so and report success are going to start a snowball effect. And it will likely be driven by money first, but it will also have the effect, that people will slowly discover, of meaning that you can better predict the model you're using. It will continue to work the same way next year that it is working this year; or if it doesn't, it's because you chose to install the new version.

And when that happens (I'm saying "when", not "if" because although it might take some time, I think it's inevitable in the long run), the frontier-model rental companies are going to struggle to stay afloat. Except for the ones who saw this coming and transitioned to a non-subscription income source somehow (maybe by selling licenses to self-host their frontier models for $$BIGNUM), or who have some other revenue stream besides renting out models.

▲

Applejinx 5 hours ago | parent | prev | next [-]

That sounds weirdly gendered even though there's no reason it should be.

Are you getting LLMsplained? :)

▲

AnimalMuppet 5 hours ago | parent | prev [-]

Well... as a human software engineer, I've been the one with very strong, intelligent, completely wrong takes. The question is, are the LLMs improving faster than you can improve a junior dev? And is their ceiling as high?

▲

smcleod 5 hours ago | parent | prev [-]

Your experience pretty much mirrors my own. I hate to be the 'they're holding it wrong' guy but there's certainly a lot of people out there that have no real idea how to effectively leverage AI.

▲

dawnerd 5 hours ago | parent [-]

That’s a problem with the tool not the people. AI is marketed literally as writing one sentence and having some app perfectly output. Just check any of the landing pages for Claude code or codex or GitHub copilot…

▲

senordevnyc 4 hours ago | parent [-]

No, it literally isn't. I just looked at the landing pages for Claude Code, Codex, Cursor, and Copilot, and literally none of them have anything about "writing one sentence and having some app perfectly output", or anything remotely like that. In fact, just the opposite: they all make clear that they're built for ongoing collaboration with AI, and have detailed descriptions of what that looks like. No one advertising the idea that you can one-shot perfect apps with these tools.

	▲	YorgishBorg 3 hours ago \| parent [-]
		[dead]