I heard it's because the labs fine tune their models for their own harness. Same reason why claude does better in claude code than cursor.