| ▲ | ykhli 5 hours ago | |
my unvalidated theory is that this comes down to the coding model’s training objective: Tetris is fundamentally an optimization problem with delayed rewards. Some models seem to aggressively over-optimize toward near term wins (clearing lines quickly), which looks good early but leads to brittle states and catastrophic failures later. Others appear to learn more stable heuristics like board smoothness, height control, long-term survivability even if that sacrifices short-term score That difference in objective bias shows up very clearly in Tetris, but is much harder to notice in typical coding benchmarks. Just a theory though based on reviewing game results and logs | ||