| ▲ | NitpickLawyer 5 hours ago | |||||||||||||
The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc. So then the next next version is even better, because it got more data / better data. And it becomes better... This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement. For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve. | ||||||||||||||
| ▲ | alex43578 5 hours ago | parent | next [-] | |||||||||||||
Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win. | ||||||||||||||
| ||||||||||||||
| ▲ | losvedir 4 hours ago | parent | prev [-] | |||||||||||||
Yeah, it's very interesting. Sort of like how you need microchips to design microchips these days. | ||||||||||||||