Remix clone Hacker News

new | show | ask | jobs Github

	▲	NitpickLawyer 2 hours ago
		This might actually be the whole value prop of this benchmark. Forget their initial scores, take open models (so we can be sure the base doesn't change), and test different combinations of harness + prompts + strategies + whatever memthing is popular today. See if the scores improve. Repeat.