Remix.run Logo
moyix a day ago

Note that MuZero did better than AlphaGo, without access to preprogrammed rules: https://en.wikipedia.org/wiki/MuZero

smokel a day ago | parent | next [-]

Minor nitpick: it did not use preprogrammed rules for scanning through the search tree, but it does use preprogrammed rules to enforce that no illegal moves are made during play.

hulium a day ago | parent [-]

During play, yes, obviously you need an implementation of the game to play it. But in its planning tree, no:

> MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree. This is possible because the network rapidly learns not to predict actions that never occur in the trajectories it is trained on.

https://arxiv.org/pdf/1911.08265

skywhopper 21 hours ago | parent [-]

That is exactly what the commenter was saying.

Zacharias030 16 hours ago | parent | next [-]

It is consistent with what the commenter was saying.

In any case, for Go - with a mild amount of expert knowledge - this limitation is most likely quite irrelevant unless in very rare endgame situations, or special superko setups, where a lack of moves or solutions push some probability to moves that look like wishful thinking.

I think this is not a significant limitation of the work (not that any parent claimed otherwise). MuZero is acting in an environment with prescribed actions, it’s just “planning with a learned model” and without access to the simulation environment.

—-

What I am less convinced by was the claim that MuZero reaches higher performance than previous AlphaZero variants. What is the comparison based on? Iso-flops, Iso-search depth, iso self play games, iso wallclock time? What would make sense here?

Each AlphaGo paper was trained on some sort of embarrassingly parallel compute cluster, but all included the punchlines for general audiences that “in just 30 hours” some performance level was reached.

gnfargbl 20 hours ago | parent | prev [-]

The more detailed clarification on what "preprogrammed rules" actually means in this case made the entire discussion significantly more clear to me. I think it was helpful.

CGamesPlay 14 hours ago | parent | prev [-]

This is true, and MuZero's paper notes that it did better with less computation than AlphaZero. But it still used about 10x more computation to get there than AlphaGo, which was "bootstrapped" with human expert moves. I think this is very important context to anyone who is trying to implement an AI for their own game.