| ▲ | Game_Ender 5 hours ago | |||||||
Why should he put effort into measuring a tool that the author has not? The point is there are so many of these tools an objective measure that the creators of these tools can compare against each other would be better. So a better question to ask is - Do you have any ideas for an objective way to a measure a performance of agentic coding tools? So we can truly determine what improves performance or not. I would hope that internal to OpenAI and Anthropic they use something similar to the harness/test cases they use for training their full models to determine if changes to claude code result in better performance. | ||||||||
| ▲ | morkalork 3 hours ago | parent [-] | |||||||
Well, if I were Microsoft and training co-pilot, I would log all the <restore checkpoint> user actions and grade the agents on that. At scale across all users, "resets per agent command" should be useful. But then again, publishing the true numbers might be embarrassing.. | ||||||||
| ||||||||