Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

▲

8note 3 hours ago | parent | next [-]

even in one agent, a different starting prompt will have you tracing a very different path through the model.

maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct

▲

xandrius 5 hours ago | parent | prev | next [-]

I think people are misunderstanding reward functions and LLMs.

LLMs don't actually have a reward system like some other ML models.

	▲	storus 4 hours ago \| parent [-]
		They are trained with one, and when you look at DPO you can say they contain an implicit one as well.

▲

throwatdem12311 3 hours ago | parent | prev [-]

It’s superstition that using a different slop generator to “review” the slop from a different brand of slop generator somehow makes things better. It’s slop all the way down.

	▲	storus 2 hours ago \| parent [-]
		https://github.com/karpathy/llm-council https://ui.adsabs.harvard.edu/abs/2025arXiv250214815C/abstra... https://www.arxiv.org/abs/2509.23537 https://www.aristeidispanos.com/publication/panos2025multiag... https://arxiv.org/abs/2305.14325 https://arxiv.org/abs/2306.05685 https://arxiv.org/abs/2310.19740v1