No such evidence exists: we can construct such a model manually. I'd need some quite convincing evidence that any given training process is approximately equivalent to that, though.

▲

vidarh a year ago | parent [-]

That's fine. I've made no claim about any given training process. I've addressed the annoying repetitive dismissal via the "but they're next token predictors" argument. The point is that being next token predictors does not limit their theoretical limits, so it's a meaningless argument.

▲

wizzwizz4 a year ago | parent [-]

The architecture of the model does place limits on how much computation can be performed per token generated, though. Combined with the window size, that's a hard bound on computational complexity that's significantly lower than a Turing machine – unless you do something clever with the program that drives the model.

	▲	vidarh a year ago \| parent [-]
		Hence the requirement for using the context for IO. A Turing machine requires two memory "slots" (the position of the read head, and the current state) + IO and a loop. That doesn't require much cleverness at all.