| ▲ | simianwords 3 days ago | ||||||||||||||||
There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example? Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a... | |||||||||||||||||
| ▲ | stratos123 3 days ago | parent | next [-] | ||||||||||||||||
Did you use the exact API call shown in the paper? I am unable to replicate the paper's counterexamples via the chat UI, but that's not very surprising (if the LLM already only fails a few cases out of thousands, the small differences in context between API and chat might fix them). | |||||||||||||||||
| |||||||||||||||||
| ▲ | pton_xd 3 days ago | parent | prev [-] | ||||||||||||||||
"in this paper we primarily evaluate the LLM itself without external tool calls." Maybe this is a factor? | |||||||||||||||||
| |||||||||||||||||