> To learn, agents must experience high-value states, which are hard (or impossible) for untrained agents to reach. The endgame-only envs were the final piece to crack 65k. The endgame requires tens of thousands of correct moves where a single mistake ends the game, but to practice, agents must first get there.

This seems really similar to the motivations around masked language modeling. By providing increasingly-masked targets over time, a smooth difficulty curve can be established. Randomly masking X% of the tokens/bytes is trivial to implement. MLM can take a small corpus and turn it into an astronomically large one.

▲

algo_trader 12 hours ago | parent | next [-]

This is less about masked modelling and more about reverse-curriculum.

e.g. DeepCubeA 2019 (!) paper to solve Rubik cube.

Start with solved state and teach the network successively harder states. This is so "obvious" and "unhelpful in real domains" that perhaps they havent heard of this paper.

▲

larrydag 13 hours ago | parent | prev [-]

perhaps I'm missing something. Why not start the learning at a later state?

▲

LatencyKills 13 hours ago | parent | next [-]

If the goal is to achieve end-to-end learning that would be cheating.

If you sat down to solve a problem you’ve never seen before you wouldn’t even know what a valid “later state” looking like.

	▲	taeric 2 hours ago \| parent [-]
		Why is it cheating? We literally teach sports this way? Often times you teach sports by learning in scaled down scenarios. I see no reason this should be different.

▲

bob1029 13 hours ago | parent | prev [-]

That's effectively what you get in either case. With MLM, on the first learning iteration you might only mask exactly one token per sequence. This is equivalent to starting learning at a later state. The direction of the curriculum flows toward more and more of these being masked over time, which is equivalent to starting from earlier and earlier states. Eventually, you mask 100% of the sequence and you are starting from zero.