Remix.run Logo
janalsncm 4 hours ago

There’s a ton of crossover between your method and RL. I guess instead of directly training on episodes and updating model weights, you just store episodes in RAM and sample from the most promising ones. It could be a neat way of getting out of infamous RL cold start by getting some examples of rewards. Thanks for sharing.

Naulius 3 hours ago | parent [-]

Thanks! You're right that there's a resemblance to RL. The original approach was proposed by Antithesis, and in Part 1 we map it more directly to a mutation-based Genetic Algorithm: stored paths are the population, the x-position scoring is the fitness function, and bit-flip input generation is the mutation operator. There's no recombination and no learned policy but just evolutionary selection pressure on input sequences.

Interesting point about the RL cold start, one could definitely use the paths discovered first through the evolutionary exploration to seed an RL agent's initial experience which could help skip the early random flailing phase.

The key difference from RL is the goal. We're not trying to learn an optimal policy for playing the game and instead we're trying to explore as much of the state space as possible to find bugs. In Part 2 we plug in a behavior model that validates correctness at every frame during exploration (velocity constraints, causal movement checks, collision invariants). The combination is where it gets interesting: autonomous exploration discovers the states, and the behavior model catches when the game violates its own rules. For testing, the main reason we even care about completing each level is that a completed path serves as the base for more extensive exploration at every point along it. If the exploration can't reach the end, by definition we miss a large part of the state space.