maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

ashdksnndck 44 minutes ago | parent | next [-]

I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.

▲

krackers 5 hours ago | parent | prev [-]

I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.

	▲	movedx01 43 minutes ago \| parent [-]
		Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.