| ▲ | fellowniusmonk 5 days ago | |
I have a very complex set of logic puzzles I run through my own tests. My logic test and trying to get an agent to develop a certain type of ** implementation (that is published and thus the model is trained on to some limited extent) really stress test models, 5.2 is a complete failure of overfitting. Really really bad in an unrecoverable infinite loop way. It helps when you have existing working code that you know a model can't be trained on. It doesn't actually evaluate the working code it just assumes it's wrong and starts trying to re-write it as a different type of **. Even linking it to the explanation and the git repo of the reference implementation it still persists in trying to force a different **. This is the worst model since pre o3. Just terrible. | ||