| ▲ | ethan_l_shen 3 hours ago | |
Hey! These are great observations. So first, while TTS can improve performance, we wanted to evaluate the raw capability of our model. This meant generating only one rollout per evaluation instance, which follows other papers in the space like SWE-smith and BugPilot. In addition, TTS adds extra inference cost and is reliant on how rollouts are ranked, two confounding factors for deployable models where memory and inference speed are extremely important. Following that line of reasoning, context length is another very large confounding factor. Longer context lengths improve performance - but also result in enormous increases in KV cache size and memory requirements. We decide to control for this in our paper and focus at the 32K context length for 32B size models, a context length that already pushes the bounds of what can be "deployable" locally. Still, we evaluate at 64K context length using YARN and are able to outperform CWM's 54% performance (non TTS), which it achieves using 128K context, a substantial increase over what we use. This is also pretty significant because we only ever train at 32K context, but CWM trains for a full 128K. | ||
| ▲ | nsjdkdkdk 2 hours ago | parent [-] | |
[dead] | ||