| ▲ | sally_glance 7 hours ago | |
This is the hard part - especially with larger initiatives, it takes quite a bit of work to evaluate what the current combination of harness + LLM is good at. Running experiments yourself is cumbersome and expensive, public benchmarks are flawed. I wish providers would release at least a set of blessed example trajectories alongside new models. As it is, we're stuck with "yeah it seems this works well for bootstrapping a Next.js UI"... | ||