Remix clone Hacker News

new | show | ask | jobs Github

	▲	LoganDark 3 hours ago
		According to the article, nearly 50% of the dataset is synthetic (8T out of 17T tokens). I don't know what constitutes "a breadth of state-of-the-art rephrasing approaches", but I lack some confidence in models trained on LLM output, so I hope it wasn't that.