Remix.run Logo
phtrivier 6 hours ago

What is the reference, unbiased, honest, reputable and trustworthy site that ranks and compare models on the couple of realistic metrics that matters ? ("Does it work for code", "no, I mean, for real", "how much does it cost", etc...) ?

kccqzy 5 hours ago | parent | next [-]

It’s not really possible unless you try. Different people use models so differently. The whole model situation has made public minute differences in personal preferences in the process of coding. Some people think carefully and strive to write code that’s as bug free as humanly possible on the first try; others write something that is only approximately correct and then iterate afterwards. The former people would align with a model that thinks for 40 minutes before producing flawless code; the latter would be driven mad by this excessive thinking. Some people like to interrupt AI as soon as they see AI making a mistake, others let AI continue and tell them about the mistake afterwards.

girvo 6 hours ago | parent | prev | next [-]

Truthfully? There isn't one. They all have flaws. Your best bet is to look at all of them, and then run a suite of evals yourself. Its rough out here!

bel8 6 hours ago | parent | prev [-]

The only metric that worked for me is running the same prompt 5x for each LLMs on my projects.

I keep specific branches a state where they are ready to develop new features.