Remix clone Hacker News

new | show | ask | jobs Github

	▲	himata4113 2 hours ago
		concept is similar to how it works in inference, instead of performing regressive writes to the entire model you run the whole model, but part of the model can live in system memory and get swapped in/out on demand. So only XB parameters are active in training. edit: I am not really sure if it works like that. I haven't looked too deep into deepseek v4 pro specifically.