Remix.run Logo
LoganDark 3 hours ago

According to the article, nearly 50% of the dataset is synthetic (8T out of 17T tokens). I don't know what constitutes "a breadth of state-of-the-art rephrasing approaches", but I lack some confidence in models trained on LLM output, so I hope it wasn't that.