| ▲ | uriegas 4 hours ago | |
I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set. | ||