| ▲ | crorella 4 hours ago | |
The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist. | ||
| ▲ | pixl97 3 hours ago | parent [-] | |
Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it. Like can the model take your plan and ask the right questions where there appear to be holes. How wide of architecture and system design around your language does it understand. How does it choose to use algorithms available in the language or common libraries. How often does it hallucinate features/libraries that aren't there. How does it perform as context get larger. And that's for one particular language. | ||