Remix.run Logo
the_harpia_io 3 hours ago

honestly the harness thing is way more important than people realize - I've been working on code security tools and the gap between what a model generates raw vs with better structure is massive, way bigger than model versions mattering. like the security bugs I see in AI code, half of them are just because the prompt didn't include enough context or the edit format was wonky

the benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'

idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control