| ▲ | noupdates 6 hours ago | |
I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit. | ||