Remix.run Logo
liamconnell 12 hours ago

> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.

Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”

“Vibes” (2/3 on scale) are ok, just honestly curious.