▲ | Art9681 3 days ago | |
It can be summarized as "Did you RTFM?". One shouldn't expect optimal results if the time and effort wasn't invested in learning the tool, any tool. LLMs are no different. GPT-5 isn't one model, it's 6: gpt-5, gpt-5 mini, gpt-nano. Each takes high|medium|low configurations. Anyone who is serious about measuring model capability would go for the best configuration, especially in medicine. I skimmed through the paper and I didnt see any mention of what parameters they used other than they use gpt-5 via the API. What was the reasoning_effort? verbosity? temperature? These things matter. |