Remix clone Hacker News

new | show | ask | jobs Github

	▲	janalsncm 6 hours ago
		It’s a data point. I could imagine in a hardware constrained setting we might not care about training on enormous token counts, and on smaller devices it’s great if we can simplify the architecture. I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.
	▲	Philpax 5 hours ago \| parent [-]
		The Chinchilla scaling laws give you a minimum for the number of tokens you should be using for a given size: if you can't meet what they suggest for that size, you should shrink the size, as, otherwise, the capacity of the model is going to waste. I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.