You're using it as a 'super compiler', effectively a code generator and your .md file is the new abstraction level at which you code.

But there is a price to pay: the code that you generate is not the code that you understand and when things go pear shaped you will find that that deterministic element that made compilers so successful is missing from code generated from specs dumped into an AI. If you one-shot it you will find that the next time you do this your code may come out quite different if it isn't a model that you maintain. It may contain new bugs or revive old ones. It may eliminate chunks of the code and you'll never know and so on.

There is a reason that generated code always had a bit of a smell to it and AI generated code is no different. How much time do you spend on verifying that it actually does what's written on the tin?

Do you write your own tests? Do you let the AI write the tests and the code? Are you familiar with the degree to which AIs can be manipulated to do stuff that you thought they weren't supposed to? (A friend of mine just proved this to his boss by bribing an AI with a 'nice batch of pure random data' to put a piece of unreviewed code into production by giving itself the privileges required to do so...)

▲

CharlieDigital 15 hours ago | parent | next [-]

We have human reviews on every PR.

Quality and consistency are going up, not down. Partially because the agents follow the guidance much more closely than humans do and there is far less variance. Shortcuts that a human would make ("I'll just write a one-off here"), the agent does not...so long as our rules guide it properly ("Let me find existing patterns in the codebase.").

Part of it is the investment in docs we've made. Part of it is that we were already meticulous about commenting code. It turns out that when the agents stumble on this code randomly, it can read the comments (we can tell because it also updates them in PRs when it makes changes).

We are also delivering the bulk of our team level capabilities via remote MCP over HTTP so we have centralized telemetry via OTEL on tool activation, docs being read by the agents, phantom docs the agent tries to find (we then go and fill in those docs).

▲

jacquesm 15 hours ago | parent | next [-]

> We have human reviews on every PR.

There are some studies about maintaining attention over longer periods of time when there is no action required. It will be difficult to keep that up forever so beware of review fatigue and bake in some measures to ensure that attention does not diminish over time.

▲

CharlieDigital 15 hours ago | parent [-]

The point of reviews is that the process of reviews is a feedback cycle where we can identify where our docs are short. We then immediately update the docs to reflect the correction.

Over enough time, this gap closes and the need for reviews goes down. This is what I've noticed as we've continued to improve the docs: PRs have stabilized. Mid-level devs that just months ago were producing highly variant levels of quality are now coalescing on a much higher, much more consistent level of output.

There were a lot of pieces that went into this. We created a local code review skill that encodes the exact heuristics the senior reviewers would use and we ask the agent to run this in AGENTS.md. We have an MCP server over HTTP that we use to deliver the docs so we can monitor centralized telemetry.

The objective is that at some point, there will be enough docs and improved models that the need for human reviews decreases while quality of code reaches a steady state that is more consistent than any human team of varying skill level could produce.

One thing we've done is to decouple the docs from the codebase to make it easier to update the docs and immediately distribute updates orthogonal to the lifecycle of a PR.

(I'll have a post at some point that goes into some of what we are doing and the methodology.)

▲

g-b-r 13 hours ago | parent [-]

> The objective is that at some point, there will be enough docs and improved models that the need for human reviews decreases while quality of code reaches a steady state that is more consistent than any human team of varying skill level could produce

There will never be a point when human reviews will be less needed; you're doomed to ship something horribly insecure at some point, if you ever remove them; please don't.

	▲	bmd1905 11 hours ago \| parent [-]
		[dead]

▲

AnimalMuppet 15 hours ago | parent | prev [-]

> Partially because the agents follow the guidance much more closely than humans do and there is far less variance.

Ouch. Managing human coders has been described as herding cats (with some justice). Getting humans to follow standards is... challenging. And exhausting.

Getting AIs to do so... if you get the rules right, and if the tool doesn't ignore the rules, then you should be good. And if you're not, you still have human reviews. And the AI doesn't get offended if you reject the PR because it didn't follow the rules.

This is actually one of the best arguments for AIs that I have seen.

	▲	CharlieDigital 15 hours ago \| parent [-]
		Yes, as I mentioned in my other replies, what I've seen is that quality has gone up and coalesced around a much higher bar with far less variance than before as we've refined our docs and tooling. In some cases, it was "instant"; dev's MCP server connected to our docs was down -> terrible PR. We fix the MCP connection and redo the PR -> instantly follows the guides we have in place for best practices.

▲

operatingthetan 15 hours ago | parent | prev [-]

>A friend of mine just proved this to his boss by bribing an AI with a 'nice batch of pure random data' to put a piece of unreviewed code into production by giving itself the privileges required to do so...

Okay that's pretty hilarious. Everyone has a vice!

	▲	jacquesm 15 hours ago \| parent [-]
		There is a chapter two to the story but I don't want to out my friend. You never know who reads HN.