Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.

▲

fhd2 4 hours ago | parent | next [-]

Their Readme.md is weirdly obsessed with "2 hours":

"before Claude Opus 4.5 started doing better than humans given only 2 hours"

"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"

"Claude Opus 4.5 after 2 hours in our test-time compute harness"

"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"

So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.

Would be quite curious to know though. How I usually design take home assignments is:

1. Candidate has several _days_ to complete (usually around a week).

2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.

But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.

▲

alcasa 3 hours ago | parent | prev [-]

No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.

▲

saagarjha 3 hours ago | parent [-]

4 hours but AI help is (was?) allowed. I assume it was retired because of Opus basically oneshotting it

	▲	alcasa 3 minutes ago \| parent [-]
		Fair enough. I feel like designing AI-proof take-homes is getting ever more futile. Given the questions need to be sufficiently low context to be human-doable in a short time and timespans for AI tasks increasing, I'm not sure take homes can actually serve any filtering function whatsoever, besides checking if applicants are willing to put in a minimal amount of effort.