Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
They do, in the paper they mention they evaluate the LLM without tools