| ▲ | Arech 6 hours ago | |||||||
That's what I thought of too. Given their task formulation (they basically said - "check these binaries with these tools at your disposal" - and that's it!) their results are already super impressive. With a proper guidance and professional oversight it's a tremendous force multiplier. | ||||||||
| ▲ | selridge 5 hours ago | parent [-] | |||||||
We are in this super weird space where the comparable tasks are one-shot, e.g. "make me a to-do app" or "check these binaries", but any real work is multi-turn and dynamically structured. But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen. It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases. | ||||||||
| ||||||||