> This paper addresses the challenge by asking: how can we trade off more compute for less data?

Autoregressive models are not matched by compute and this is the major drawback.

There is evidence that training RNN models that compute several steps with same input and coefficients (but different state) lead to better performance. It was shown in a followup to [1] that performed ablation study.

[1] https://arxiv.org/abs/1611.06188

They fixed number of time steps instead of varying it, and got better results.

Unfortunately, I forgot the title of that ablation paper.

▲ kevinwang 2 days ago | parent | next [-]

Not sure if you meant this because it doesn't cite the paper you mention, but it's a similar work: "An Investigation of Model-Free Planning", Guez et Al. (Deepmind) 2019 https://arxiv.org/abs/1901.03559

	▲	astrange 2 days ago \| parent [-]
		Speaking of not citing, that one could go a bit further back. https://cdn.aaai.org/AAAI/1987/AAAI87-048.pdf

▲ imtringued 2 days ago | parent | prev [-]

It has already been proven that deep equilibrium models with a single layer are equivalent to models with a finite number of layers and the converse as well. That you can get the performance of a DEQ using a finite number of layers.

The fixed point nature of DEQs means that they inherently have a concept of self assessment how close they are to the solution. If they are at the solution, they will simply stop changing it. If not, they will keep performing calculations.