Is it not much simpler to parallelize by having different "readers" (using the same model parameters/weights) process different parts of the corpus in parallel? reader A is reading book A, while reader B is reading book B etc...?

Is there a deeper reason why more complicated parallelization as in the OP or the article it references is more desirable?

▲

jsharf 3 days ago | parent | next [-]

If you have independent copies of the network learning gradients, then you’re effectively making the batch size smaller— unless you’re doing an all collect and making them sync, in which case there’s a lot of overhead

When you take a batch and calculate gradients, you’re effectively calculating a direction the weights should move in, and then taking a step in that direction. You can do more steps at once by doing what you say, but they might not all be exactly in the right direction, so overall efficiency is hard to compare

I am not an expert, but if I understand correctly I think this is the answer.

	▲	immibis 3 days ago \| parent [-]
		Batch size is just averaging the gradients from multiple calculations.

▲

zozbot234 3 days ago | parent | prev [-]

AIUI, the thinking when developing transformers might have been that "reading text A vs. text B" just isn't parallel enough for truly large-scale training. The problem was to somehow also parallelize the learning of very long range dependencies within a single sequence, and transformers managed to do that.