▲ | Workaccount2 4 days ago | |||||||
Maybe I am not understanding the paper correctly, but it seems they tested "state of the art models" which is almost entirely composed of open source <27B parameter models. Mostly 8B and 3B models. This is kind of like giving algebra problems to 7 year olds to "test human algebra ability." If you are holding up a 3B parameter model as an example of "LLM's can't reason" I'm not sure if the authors are confused or out of touch. I mean, they do test 4o and O1 preview, but their performance is notablely absent from the paper's conclusion. | ||||||||
▲ | dartos 4 days ago | parent [-] | |||||||
It’s difficult to reproducibly test openai models, since they can change from under you and you don’t have control over every hyperparameter. It would’ve been nice to see one of the larger llama models though. | ||||||||
|