| ▲ | embedding-shape 2 hours ago | |
> I guess the goal is to test the models and not the harness Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience. | ||