a good benchmark would probably porting a selected repo to another language. then clear context notes, and have it port it back.
as long as theres a test framework, you could gauge success deterministically.