Benchmark it against a fast Python interpreter optimized for AI tool calling, like Monty: https://github.com/pydantic/monty