Remix clone Hacker News

new | show | ask | jobs Github

	▲	kostaj 2 hours ago
		Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version. Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority. Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.