That's fine. I've made no claim about any given training process. I've addressed the annoying repetitive dismissal via the "but they're next token predictors" argument. The point is that being next token predictors does not limit their theoretical limits, so it's a meaningless argument.

▲

wizzwizz4 2 years ago | parent [-]

The architecture of the model does place limits on how much computation can be performed per token generated, though. Combined with the window size, that's a hard bound on computational complexity that's significantly lower than a Turing machine – unless you do something clever with the program that drives the model.

	▲	vidarh 2 years ago \| parent [-]
		Hence the requirement for using the context for IO. A Turing machine requires two memory "slots" (the position of the read head, and the current state) + IO and a loop. That doesn't require much cleverness at all.