| ▲ | jmathai a day ago | |||||||
You may also be getting a worse result for higher cost. For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary. Still waiting on human evaluation to confirm the LLM Judge was correct. | ||||||||
| ▲ | lorey a day ago | parent | next [-] | |||||||
That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well. | ||||||||
| ▲ | andy99 a day ago | parent | prev [-] | |||||||
You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one. | ||||||||
| ||||||||