▲ | a_bonobo 20 hours ago | ||||||||||||||||||||||||||||
Could the exclusion of CoT that be because of this recent Anthropic paper? https://assets.anthropic.com/m/71876fabef0f0ed4/original/rea... >We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. I.e., chain of thought may be a confabulation by the model, too. So perhaps there's somebody at Anthropic who doesn't want to mislead their customers. Perhaps they'll come back once this problem is solved. | |||||||||||||||||||||||||||||
▲ | whimsicalism 17 hours ago | parent | next [-] | ||||||||||||||||||||||||||||
i think it is almost certainly to prevent distillation | |||||||||||||||||||||||||||||
▲ | andrepd 13 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||
I have no idea what this means, can someone give the eli5? | |||||||||||||||||||||||||||||
|