▲ | CSMastermind 2 days ago | ||||||||||||||||
Okay, I decided to benchmark a bunch of AI models with geoguessr. One round each on diverse world, here's how they did out of 25,000: Claude 3.7 Sonnet: 22,759 Qwen2.5-Max: 22,666 o3-mini-high: 22,159 Gemini 2.5 Pro: 18,479 Llama 4 Maverick: 14,316 mistral-large-latest: 10,405 Grok 3: 5,218 Deepseek R1: 0 command-a-03-2025: 0 Nova Pro: 0 | |||||||||||||||||
▲ | nemo1618 2 days ago | parent | next [-] | ||||||||||||||||
Neat, thanks for doing this! | |||||||||||||||||
▲ | msephton 2 days ago | parent | prev | next [-] | ||||||||||||||||
How does Google Lens compare? | |||||||||||||||||
| |||||||||||||||||
▲ | bn-l 2 days ago | parent | prev [-] | ||||||||||||||||
What about 04-mini-high ? | |||||||||||||||||
|