Remix.run Logo
jareds a day ago

I'll look at it when this shows up on https://aider.chat/docs/leaderboards/ I feel like keeping up with all the models is a full time job so I just use this instead and hopefully get 90% of the benefit I would by manually testing out every model.

evantbyrne a day ago | parent | next [-]

Are these just leetcode exercises? What I would like to see is an independent benchmark based on real tasks in codebases of varying size.

rafram a day ago | parent | next [-]

Aider uses a dataset of 500 GitHub issues, so not LeetCode-style work.

evantbyrne a day ago | parent [-]

It says right on that linked page:

> Aider’s polyglot benchmark tests LLMs on 225 challenging Exercism coding exercises across C++, Go, Java, JavaScript, Python, and Rust.

I looked up Exercism and they appear to be story problems that you solve by coding on mostly/entirely blank slates, unless I'm missing something? That format would seem to explain why the models are reportedly performing so well, because they definitely aren't that reliable on mature codebases.

KaoruAoiShiho a day ago | parent | prev [-]

Aider is not just leetcode exercises I think? livecodebench is leetcode exercises though.

a day ago | parent | prev [-]
[deleted]