Remix clone Hacker News

new | show | ask | jobs Github

	▲	liamconnell 12 hours ago
		> It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.” “Vibes” (2/3 on scale) are ok, just honestly curious.